Translate

Pages

Pages

Pages

Intro Video
Showing posts with label Gabe's Musings. Show all posts
Showing posts with label Gabe's Musings. Show all posts

Wednesday, October 21, 2020

Countdown to AI Experience EMEA Virtual Conference

The AI Experience EMEA Virtual Conference is just three weeks away, and Team DataRobot couldn’t be more excited about what our speakers have in store for our virtual audience. Conference attendees will walk away with pragmatic ideas for how to accelerate time to impact and value for AI, and practical strategies to implement those ideas within their organizations — all at no cost to attend. This day-long event is an experience that no AI visionary will want to miss. 

We’re pleased to announce two keynote speakers that conference attendees can look forward to hearing from on November 10th. 

Brian Prestidge: Director of Insights & Decision Technology at Manchester City Football Club

BP 2020 21 1

With 15 years working in professional football in various roles supporting elite performance practitioners, Brian has seen the technological developments that have created an exponential growth in data available to football clubs. Whether through the use of AI & simulation for player development or the application of robust data science methods to support coaches in their game preparation, Brian & his team play a key role at City Football Group in enabling them to make better and faster decisions in the very dynamic and heavily scrutinised environment of professional football.

Dr. Hannah Fry: Associate Professor in the Mathematics of Cities, Centre for Advanced Spatial Analysis at University College London 

Hannah Fry Main Website image speaking

Dr. Hannah Fry is an Associate Professor in the Mathematics of Cities at the Centre for Advanced Spatial Analysis at UCL where she studies patterns in human behavior. Her research applies to a wide range of social problems and questions, from shopping and transport to urban crime, riots and terrorism.

Her critically acclaimed BBC documentaries include Horizon: Diagnosis on Demand? The Computer Will See You Now, Britain’s Greatest Invention, City in the Sky (BBC Two), Magic Numbers: Hannah Fry’s Mysterious World of Maths, The Joy of Winning, The Joy of Data, Contagion! The BBC Four Pandemic and Calculating Ada (BBC Four). She also co-presents The Curious Cases of Rutherford and Fry (BBC Radio 4) and The Maths of Life with Lauren Laverne (BBC Radio 6).

Hannah is the author of Hello World, published in 2018.


We hope you’ll join us on November 10th to hear these keynotes, and our full lineup of powerhouse speakers, share their insights on impactful, trustworthy AI. Leaders from Bayer Pharmaceuticals, Deutsche Post DHL Group, Medical Faculty Manheim, Heidelberg University, and more will help you understand how to leverage AI to address hyper-critical issues impacting your organization.

Virtual Event
AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI

The post Countdown to AI Experience EMEA Virtual Conference appeared first on DataRobot.



from Blog – DataRobot https://ift.tt/3kiPGkz
via Gabe's MusingsGabe's Musings

Monday, October 19, 2020

AI in Financial Markets, Part 5 of 4: Flippantly Answered Questions

This was going to be a four-part blog series. I figured that I’d covered most of what financial markets participants might be interested in hearing about when it comes to AI in the markets (and had probably bored the audience for long enough, all things considered). Then we did a webinar on the topic together with FactSet, had a great turnout, and got asked lots of great questions from our listeners — so many, in fact, that we had to answer them by email as we ran out of time.  And the questions were sufficiently good that we thought you might be interested in reading the Q&A, too, so here goes.

Financial markets are known to have an extremely non-stable structure. How do you find the right balance within your model between having a model that reacts quickly using mostly recent data (to have a better fit) vs. having a more stable model that can use longer (hence, more) data?

In one word: empirically. Remember that in financial markets, in particular, and time-series modeling in general, more history doesn’t automatically mean that you get better models. If a longer history means that you enter a different behavioral régime, your model performance will deteriorate. So trying out different lengths of the training period can be a valuable way of discovering how persistent a behavior actually is. Don’t go nuts, though. You need to avoid trying out so many variations that you end up with something that, by pure randomness, looks good but doesn’t generalize. You can avoid this kind of overfitting by being specific in your hypothesis formulation and rigorously testing your candidate models on multiple out-of-sample, out-of-time datasets; at all costs, avoid selection and look-ahead biases.

Are there particular areas of the markets where you see machine learning working better or worse than in others?

We see machine learning models adding value in lots of different areas of the markets and many different asset classes. As previously discussed, we see particular power in predicting second-order variables (i.e. modeling variables that influence returns, instead of directly predicting returns) and in a variety of use cases that form part of the business of financial markets (see blog post 1).

One interesting pattern that we have noticed is that generally, the longer the prediction horizon or data frequency, the more difficult it is to build good machine learning models.  So quarterly numbers tend to be very difficult to work with, and monthly numbers can be a struggle sometimes, too. With such infrequent data, there is a trade-off between using a very limited number of data points in order to not go beyond the current behavioral environment (limited data often makes it hard to build good models), or working with a longer history that spans multiple behavioral régimes (thus being a worse fit for the individual ones, so again it’s hard to build good models). On the other hand, there are a lot of great use cases using tick and market microstructure data, which is a bit like a firehose: if you need some more data to work with, you just need to open the tap for a little while.

Another pattern we see is not a surprise: the less efficient the price discovery mechanism is, the more value that machine learning models can add—as long as there’s enough training data. So machine learning models probably won’t add much value on the U.S. Treasury Bond future, nor will they on a corporate bond that trades once every six months: the sweet spot will be somewhere in between.

One other thing that’s worth mentioning: simulation-type problems can be a bit of a tricky fit for supervised machine learning, as there’s often no clear concept of a “target variable” to predict. That said, you can use machine learning models to make the predictions which enter the simulations, instead of linear or parametric models; this generally doesn’t make the process any faster, but allows the simulator to take advantage of non-linearity, which can help. In some cases, machine learning can also be used to predict simulation outcomes as a function of the simulation input parameters; this can, for instance, make it much faster to price certain exotic derivatives.

You mentioned the difficulties that certain machine learning algorithms have with extrapolating beyond the bounds of the training data. If need be, can you focus on those algorithms that are able to extrapolate in DataRobot?

Yes, very easily — DataRobot’s leaderboard ranks the different candidate blueprints, models and algorithms by their out-of-sample performance. If you don’t want to use a model based on, say, decision trees, you would simply select a model trained using another algorithm family. The leaderboard comparison will show you whether there’s a trade-off between that model and the tree-based models in terms of out-of-sample performance.

As a reminder: even if an algorithm is able to make predictions that go beyond the limits of the training data, those predictions won’t necessarily make any sense, as you have no experience on whether the behaviors inside the training data are consistent in environments beyond it. Proceed with extreme caution!

How would you handle scenarios where certain predictors are missing data for a few years in a longitudinal dataset, maybe because the data gathering did not begin for that predictor until recently?

First, I’d check whether the predictors actually add any value to the model by building a few different variants of the model: one trained on the full dataset including the variables with limited history (let’s call this Model A), trained over the period for which the full data is available, and another trained over the same period of time but excluding the limited-history variables 

(Model B). I’d also train a third model covering the full history that excludes the limited history variables (Model B*). If Model A performs better than Model B, I probably would take that result and not investigate further; if it doesn’t, the comparison between Model B and Model B* will tell me whether adding further history actually helps model performance. If it does, and it’s better than Model A, I’d look for proxies for the limited history variables with a longer history; if not, Model A is good to go.  

If you’re referring to a scenario where you’re looking to backtest a strategy over a longer period of time and some of the data in your current model wouldn’t have been available in past periods, the solution is even simpler: evaluate a model built on just the longer history data for the period when the shorter history data isn’t available, then use a model built on the full dataset once it’s available.

So, tl;dr: try out different variants, empiricism wins again. Don’t go crazy with the different variants, as you don’t want to do the data science version of p-value hacking (quants will know this as the dreaded “data mining”). But comparing multiple models built in different ways usually gives good insights, especially when using DataRobot’s standardized analytics.  

We hear a lot about the hybrid approach in machine learning. What is it, and does DataRobot support it?

Generally, hybrid approaches in machine learning combine two or more different types of algorithms in order to reduce model error and potentially solve problems which the individual algorithms would be less suited to, or less performant at. DataRobot has quite a few blueprints (machine learning pipelines) which use such approaches, typically combining a supervised machine learning algorithm (one that is designed to predict a target variable by learning from historical observations) with one or more unsupervised learning techniques (clustering, dimensionality reduction). We find that adding clustering, in particular, to a supervised machine learning algorithm like XGBoost can reduce prediction error by 10-15%, depending on the use case.

How does the greedy search algorithm to populate DataRobot’s leaderboard work?

In a nutshell: we first identify the set of all the machine learning pipelines (“blueprints”) that can be applied to the problem at hand, then use a combination of heuristics (to ensure algorithm diversity) and recommendation (to identify those blueprints that are likely to be performant) to identify the initial algorithms. Multiple rounds of model training ensue, starting with a large spectrum of blueprints that are trained on small amounts of data, gradually reducing the number of blueprints trained (filtering on out-of-sample performance), while increasing the size of the training data, finally cross-validating the best-performing algorithms and trying out some ensembles to see whether this will further improve the performance.

Please elaborate on the different types of feature extraction that DataRobot does.

DataRobot does four main kinds of feature extraction and selection automatically: 

  • Transforming features to match a particular machine learning algorithm or make it more performant (automated feature engineering), including dimensionality reduction using techniques such as principal-component analysis or singular value decomposition
  • Evaluating differences, ratios and other transformations and combinations in datasets where observations are independent (automated feature discovery)
  • Constructing rolling transformations and evaluating different lags in time series problems where autoregressiveness is present (time series feature engineering)
  • Automatically generating a reduced feature list on a modeling project’s best model and retraining it (automated feature selection) 

Additionally, users have the flexibility to build a wide range of feature transformations using the DataRobot Paxata data preparation platform before pushing the data to DataRobot MLdev.  The MLdev API also integrates seamlessly with Python and R’s powerful data preparation capabilities, as well as providing connectivity to other databases such as KDB.

What are the advantages of an enterprise solution like DataRobot compared to open platforms like scikit-learn or Tensorflow?

Cutting edge data science and machine learning are simply unthinkable without open-source packages such as Tensorflow; this is where the innovation lies these days. That said, DataRobot is not built by the crowd. We have some 350 incredibly talented engineers and data scientists on our team, whose job it is to engineer our platform to enterprise grade and work with our customers to ensure that it meets their needs. This includes a number of contributors to popular open-source libraries such as numpy, pandas, scikit-learn, keras, caret, pic2vec, urllib3 and many others.   

So we take the best of what’s out there in the open-source data science community and ensure that it’s suitable for enterprise use — contributing to the open source community itself when needed to make this happen.  For example, recently members of our team were building some modeling pipelines, including elements from an open-source machine learning library which we had not previously supported. Their testing revealed some critical bugs under the hood and development efforts were then refocused towards fixing the actual open-source library and pushing those changes out to the community.   

With a “best of both worlds” solution such as DataRobot, there’s still someone at the end of a phone to shout at if there’s an issue. And you don’t have to worry about making sure that all the parts of the open source stack are up-to-date either.

Does the DataRobot engine run on my desktop computer? How is performance managed, CPU vs GPU selection, etc?

DataRobot is a powerful platform whose requirements exceed the capabilities of a single desktop computer. There are various ways of running the DataRobot platform

  • On DataRobot’s managed AI cloud 
  • Via the FactSet Workstation, with the backend running on Factset’s AWS cloud 
  • Inside your enterprise’s firewall, on a Virtual Private Cloud such as Microsoft Azure, Amazon Web Services or Google Cloud
  • Inside your enterprise’s firewall, on a data lake/cluster running Hadoop; and
  • Inside your enterprise’s firewall, on a bare-metal Linux cluster 

Performance is managed dynamically by the DataRobot app engine, with the user being able to choose how much compute to apply to a modeling project by selecting the number of workers (each worker being able to train one machine learning model at one time). DataRobot runs entirely on CPUs, no expensive GPUs are needed.

Would you say that DataRobot’s learning curve is digestible for a portfolio manager or analyst, or is it targeted at in-house data analysts and quants who would live in the app?

I’d say that a certain amount of data literacy is important — I wouldn’t expect great results from giving DataRobot to an “old school” portfolio manager who struggles with Excel, for instance. We have two target user groups: first, people who understand the data well but aren’t quants or machine learning experts and want to be able to harness the power of machine learning without needing to get technical or learn how to code. We greatly automate the process with smart default settings and a variety of guardrails for this “democratization” audience. Through the process of using DataRobot and its built-in explainability and documentation, such users learn a lot about machine learning and how to frame machine learning problems, often quickly moving on to complex problem sets.

Our other target group is, of course, sophisticated quants and data scientists, who use DataRobot’s automation as a force multiplier for their productivity, by automating the boring, repetitive stuff where they don’t necessarily have an edge.

Is there a course designed around DataRobot to give us hands-on experience?

A wide variety of instructor-led and self-paced training programmes for different skill levels are available at https://university.datarobot.com/, with further resources and self-paced learning at DataRobot Community’s learning center: https://community.datarobot.com/t5/learning-center/ct-p/Learning 

There’s also the DataRobot free trial, details at:

In your demo, you built 72 different models to solve this binary classification problem. Some users may not have the expertise in machine learning to make the choice between models, and blindly using a model can be dangerous. What do you do to prevent from giving a machine gun to a 3 year old?

Great question.  It’s a combination of several things that work together. 

First, make sure that the machine gun has “safety mechanisms” such as built-in best practices and guardrails. For example, rank models strictly on their out-of-sample performance and no in-sample performance data ever being exposed, and combine the appropriate data engineering with each algorithm in the machine learning pipelines.  

Second, train the users in “gun safety.”  This doesn’t have to take that long — for instance, our Citizen Data Scientist Starter Quest takes an afternoon and is self paced; our AutoML I course consists of three four-hour sessions — but provides valuable context in how to frame machine learning problems and evaluate the models.  

Third, make sure that the gun’s “scope” shows the users what they’re pointing at: provide users with sophisticated, standardized analytics that allow them to evaluate each model’s performance in-depth and understand the model’s drivers and how the model will respond in different scenarios.

And finally, support the users with experienced data scientists, a wealth of self-service content, and a growing online user community. (Sorry, ran out of gun metaphors.)

What, over 2,000 words and you still haven’t answered my question?

Hot damn.  Come on over to community.datarobot.com and we’ll do our best to answer it there.

Check out all of the blog series: part 1part 2part 3part 4.

Webinar
Machine Learning for Quant Investing with DataRobot on FactSet

The post AI in Financial Markets, Part 5 of 4: Flippantly Answered Questions appeared first on DataRobot.



from Blog – DataRobot https://ift.tt/2IA7f1x
via Gabe's MusingsGabe's Musings

Tuesday, October 6, 2020

Tableau Conference-ish 2020

Unlock AI For BI

DataRobot is proud to once again be sponsoring Tableau Conference, or Tableau Conference-ish 2020 (TC20) as it has been minted this year for its virtual format. From October 6-8th, we invite you to learn more about the perfect union of AI and BI at our virtual booth. Swing by for helpful content, the opportunity to sign up for a free trial (tailored just for you), and some sought-after DataRobot swag.  

TC20 will also provide the opportunity to hear and learn directly from joint DataRobot + Tableau customers during our session, “DataRobot: Unlock AI for BI.” Hear from data leaders from the NHS, Alcon, and our host, Tableau, as they speak with me, Karin Jenson, Director of Business Analyst AI Success. Panelists will share how they use DataRobot with Tableau to leverage AI and automated machine learning to solve complex predictive problems easily. Our panelists will share jaw-dropping use cases and talk about the upward career trajectory that the incorporation of AI and machine learning into business analytics has provided for them. Their stories are truly inspirational and not to be missed. Be sure to take a look at the TC20 agenda to take note of when the session episode airs for you locally.

After the session, join me and other members of the DataRobot team at our virtual booth, where you can sign up for the TC20-specific trial. Not only will the trial get you hands-on experience with DataRobot, but you’ll also get: 

  • Informative resources and self-paced content that will help you succeed in your predictive insights journey
  • Community support specifically for new users coming from Tableau Conference 2020
  • The opportunity to join the DataRobot for the Business Analyst advisory board. This exclusive invitation gives you access to DataRobot swag, beta sneak peeks, assistance from leading data scientists, and invitations to quarterly meetings to weigh in on what you need from DataRobot to further meet your needs.

We look forward to connecting with you virtually at Tableau Conference-ish. Until then, please take a moment to get pumped up with a sneak peak of one of our commercials airing during the Super Bowl of analytics. 

Virtual Event
Join us at TC20

October 6-8, 2020

The post Tableau Conference-ish 2020 appeared first on DataRobot.



from Blog – DataRobot https://ift.tt/34oXkDk
via Gabe's MusingsGabe's Musings

Friday, October 2, 2020

AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI

In 2020, we’re all pivoting business strategies, focusing on impact, and finding ways to “do it better and different” (coincidentally, a DataRobot core value.) Organizations across industries have needed to become more resilient and focus on reducing costs and risks, retaining customers, and finding new revenue streams. 

In June, we held our first-ever virtual AI conference, AI Experience Worldwide to address these challenges. The conference was such a hit amongst attendees that we held another for our APAC audience. And now, we’re coming for EMEA.

November 10, 2020, join us for a full day of AI insights from our in-house experts and industry leaders from a wide range of industries. 

This event is designed to bring together AI visionaries, creators and consumers of AI, partners, customers, and decision makers to:

  • Learn how pragmatic, value-focused AI is making a measurable impact on organizations.
  • Understand how to address the hyper-critical issues impacting your organization today, throughout the rest of 2020 and into 2021. 
  • Hear AI-driven use cases from pioneers across a range of industries.
  • Network with industry peers facing the exact same challenges you are.

Trustworthy, impactful AI is an increasingly important priority for organizations. In the recently-released 2020 Gartner Hype Cycle for Artificial Intelligence, algorithmic trust, democratized AI, and AI governance were prominently featured. As you plan for 2021, it’s critical to derive the most value you can from your data. 

Our goal is for you to walk away from the AI Experience EMEA Virtual Conference with actionable insights to take your organization to the next level of AI Success with trustworthy, impactful AI. Wherever you are on your AI journey, we’re committed to being your trusted partner every step of the way.

VIRTUAL CONFERENCE
AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI

The post AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI appeared first on DataRobot.



from Blog – DataRobot https://ift.tt/34doA7I
via Gabe's MusingsGabe's Musings

Tuesday, September 15, 2020

Seaborn Version 0.11.0 is here with displot, histplot and ecdfplot

Seaborn Version 0.11.0 is Here

Seaborn Version 0.11 is Here

Seaborn, one of the data visualization libraries in Python has a new version, Seaborn version 0.11, with a lot of new updates. One of the biggest changes is that Seaborn now has a beautiful logo. Jokes apart, the new version has a lot of new things to make data visualization better. This is a quick blog post covering a few of the Seaborn updates.

displot() for univariate and bivariate distributions

One of the big new changes is “Modernization of distribution functions” in Seaborn version 0.11. The new version of Seaborn has three new functions displot(), histplot() and ecdfplot() to make visualizing distributions easier. Yes, we don’t have to write your own function to make ECDF plot any more.

Seaborn’s displot() can be used for visualizing both univariate and bivariate distributions. Among these three new function, displot function gives a figure level interface to the common distribution plots in seaborn including histograms (histplot), density plots, empirical distributions (ecdfplot), and rug plots. For example, we can use displot() and create

  • histplot() with kind=”hist” (this is default)
  • kdeplot() (with kind=”kde”)
  • ecdfplot() (with kind=”ecdf”)
  • We can also add rugplot() to show the actual values of the data to any of these plots.

    Don’t get confused with distplot() for displot(). displot() is the new distplot() with better capabilities and distplot() is deprecated starting from this Seaborn version.

    With the new displot() function in Seaborn, the plotting function hierarchy kind of of looks like this now covering most of the plotting capabilities.

    Searborn Plotting Functions Hierarchy


    In addition to catplot() for categorical variables and relplot() for relational plots, we now have displot() covering distributional plots.

    Let us get started trying out some of the functionalities. We can install the latest version of Seaborn

    pip install seaborn
    

    Let us load seaborn and make sure we have Seaborn version 0.11.

    import seaborn as sns
    print(sns.__version__)
    0.11.0
    

    We will use palmer penguin data set to illustrate some of the new functions and features of seaborn. Penguins data is readily available as part of seaborn and we can load using load_dataset() function.

    penguins = sns.load_dataset("penguins")
    
    penguins.head()
            species island  bill_length_mm  bill_depth_mm   flipper_length_mm       body_mass_g     sex
    0       Adelie  Torgersen       39.1    18.7    181.0   3750.0  Male
    1       Adelie  Torgersen       39.5    17.4    186.0   3800.0  Female
    2       Adelie  Torgersen       40.3    18.0    195.0   3250.0  Female
    3       Adelie  Torgersen       NaN     NaN     NaN     NaN     NaN
    4       Adelie  Torgersen       36.7    19.3    193.0   3450.0  Female
    

    We can create histograms with Seaborn’s histplot() function, KDE plot with kdeplot() function, and ECDF plot with ecdfplot(). However, we primarily use displot() to illustrate Seaborn’s new capabilities.

    Histograms with Seaborn displot()

    Let us make a simple histogram with Seaborn’s displot() function.

    plt.figure(figsize=(10,8))
    sns.displot(penguins, 
                x="body_mass_g", 
                bins=25)
    plt.savefig("Seaborn_histogram_with_displot.png",
                        format='png',dpi=150)
    

    Here we have also specified the number of bins in the histogram.

    Seaborn histogram with displot()


    We can also color the histogram by a variable and create overlapping histograms.
    plt.figure(figsize=(10,8))
    sns.displot(penguins,
                x="body_mass_g", 
                hue="species", 
                bins=25)
    plt.savefig("Seaborn_overlapping_histogram_hue_with_displot.png",
                        format='png',dpi=150)
    

    In this example, we color penguins’ body mass by species.

    Seaborn displot(): overlapping histograms using hue

    Facetting with Seaborn displot()

    With “col” argument we can create “small multiples” or faceting to create multiple plots of the same type using subsets of data based on a variable’s value.

    plt.figure(figsize=(10,8))
    sns.displot(penguins, 
                x="body_mass_g",
                col="species", 
                bins=25)
    plt.savefig("Seaborn_facetting_histogram_col_with_displot.png",
                        format='png',dpi=150)
    

    Here, we have facetted by values of penguins’ species in our data set.

    Seaborn displot(): facetting histogram using col

    Density plot with Seaborn’s displot()

    Let us use displot() and create density plot using kind=”kde” argument. Here we also color by species variable using “hue” argument.

    plt.figure(figsize=(10,8))
    sns.displot(penguins,
                x="body_mass_g", 
                hue="species", 
                kind="kde")
    plt.savefig("Seaborn_kernel_density_plot_with_displot.png",
                        format='png',dpi=150)
    

    Seaborn displot(): kernel density plots

    Check out the Seaborn documentation, the new version has a new ways to make density plots now.

    ECDF Plot with Seaborn’s displot()

    One of the personal highlights of Seaborn update is the availability of a function to make ECDF plot. ECDF aka Empirical Cumulative Distribution is a great alternate to visualize distributions.

    In an ECDF plot, x-axis correspond to the range of data values for variables and on the y-axis we plot the proportion of data points (or counts) that are less than are equal to corresponding x-axis value.

    Unlike histograms and density plot, ECDF plot enables to visualize the data directly without any smoothing parameters like number of bins. Its use possibly visible when you have multiple distributions to visualize.

    A potential disadvantage is that

    the relationship between the appearance of the plot and the basic properties of the distribution (such as its central tendency, variance, and the presence of any bimodality) may not be as intuitive.

    Let us make ecdf plot using displot() using kind=”ecdf”. Here we make ecdf plot of a variable and color it based on values of another variable.

    plt.figure(figsize=(10,8))
    sns.displot(penguins, 
                x="body_mass_g", 
                hue="species",
                kind="ecdf")
    plt.savefig("Seaborn_ecdf_plot_with_displot.png",
                        format='png',dpi=150)
    

    Seaborn displot(): Empirical Cumulative Density Function (ECDF) Plot

    Bivariate KDE plot and Histogram with displot()

    With kdeplot(), we can also make bivariate density plot. In this example, we use displot() with “kind=’kde'” to make bivariate density/ contour plot.

    plt.figure(figsize=(10,8))
    sns.displot(data=penguins, 
                x="body_mass_g", 
                y="bill_depth_mm", 
                kind="kde", 
                hue="species")
    plt.savefig("Seaborn_displot_bivariate_kde_contour.png",
                        format='png',dpi=150)
    

    Seaborn displot(): bivariate KDE Density plot

    We can also make bivariate histogram with displot() using kind=”hist” option or histplot() to make density plot.

    plt.figure(figsize=(10,8))
    sns.displot(data=penguins, 
                x="body_mass_g",
                y="bill_depth_mm",
                kind="hist", 
                hue="species")
    plt.savefig("Seaborn_displot_bivariate_hist.png",
                        format='png',dpi=150)
    

    Seaborn displot() Bivariate histogram

    New features to Seaborn jointplot()

    With Seaborn 0.11, jointplot also has gained some nice features. Now jointplot() can take “hue” as argument to color data points by a variable.

    sns.jointplot(data=penguins, 
                  x="body_mass_g", 
                  y="bill_depth_mm", 
                  hue="species")
    
    Seaborn jointplot color by variable using "hue"

    Seaborn jointplot color by variable using “hue”

    And jointplot() also gets a way to plot bivariate histogram on the joint axes and univariate histograms on the marginal axes using kind=”hist” argument to jointplot().

    sns.jointplot(data=penguins, 
                  x="body_mass_g", 
                  y="bill_depth_mm", 
                  hue="species", 
                  kind="hist")
    
    Seaborn jointplot color by variable: bivariate histogram

    Seaborn jointplot color by variable: bivariate histogram

    Another big change that will help writing better code to make data visualization is that most Seaborn plotting functions, will now require their parameters to be specified using keyword arguments. Otherwise, you will see FutureWarning in v0.11.

    As part of the update, Seaborn has also got spruced up documentation for Seaborn’s capabilities. Check out the new documentation on data structure that is accepted by Seaborn plotting functions. Some of the functions can take the data in both wide and long forms of data. Currently, the distribution and relational plotting functions can handle both and in future releases other Seaborn functions also will get the same data inputs.

    The post Seaborn Version 0.11.0 is here with displot, histplot and ecdfplot appeared first on Python and R Tips.



    from Python and R Tips https://ift.tt/2RscPUT
    via Gabe's MusingsGabe's Musings

    Monday, August 24, 2020

    Forrester Total Economic Impact Study of DataRobot: 514% ROI and Payback within 3 Months

    With AI proving to be the most transformative technology of our time and companies today need to pivot faster and drive near-term impact from their tech investments, many organizations are looking to drive higher ROI from AI as quickly as possible. From predicting customer churn to reducing fraud and avoiding supply chain disruptions, the possibilities are virtually limitless as to how AI can increase revenue, reduce costs, and improve profitability. But how can companies predict the expected value of an AI application or use case so they can justify new investments in the face of tight budgets and headcounts?

    To help answer this question, DataRobot today announced the results of a new study: The Total Economic Impact (TEI) of DataRobot. Conducted by Forrester Consulting on behalf of DataRobot, the commissioned study reveals that organizations using DataRobot’s AI platform achieve a net present value (NPV) of $4 million and a 514% return on investment (ROI) with payback often within as short as 3 months.

    Forrester interviewed four customers with experience using DataRobot in the retail, healthcare, and manufacturing sectors as the basis for the report to help them better understand the benefits, costs, and risks associated with using the platform. These customers were looking to overcome key challenges, such as forecasting demand and tackling fraud. 

    Prior to using DataRobot, the customers relied on their data science teams to do the heavy lifting of data preparation, model development and training, and model deployment using traditional open-source technologies, such as the Python and R programming languages and their associated libraries and frameworks.

    Some customers were also hindered by their use of legacy data analysis technologies that failed to keep pace with the advancements in AI and machine learning over the past decade. This created environments with lengthy AI project timelines and frequently missed deadlines where organizations often never deployed and operationalized the models they developed.

    Forrester Consulting designed a composite organization based on the characteristics of the organizations interviewed. They then constructed a financial model representative of the interview finding, using the TEI methodology based on four fundamental elements: benefits, costs, flexibility, and risks. Forrester’s TEI methodology serves to provide a complete picture of the total economic impact of purchase decisions. 

    The results were remarkable, with firms reporting significant value relative to cost.

    • Cost savings from reduced seasonal overhiring in retail: $500,000
    • Cost savings from reduced healthcare fraud: $10 million
    • Increased revenue from improved demand forecasting in manufacturing: $50 million – $200 million
    • Significant cost savings from avoidance of hiring a data science team 3x as large

    Determining ROI of AI

    Many of our customers ask for help in estimating the value of AI when it’s being used to augment an existing process where some predictive analytics are already in place. The methodology in this report, drawing on data from 4 real-world customer deployments of AI, should provide a useful framework for anyone looking to justify an AI investment.

    Today’s business climate has never been so challenging, and organizations need agile trusted solutions that can steer intelligent decision-making in the face of market turbulence and continuously evolving customer needs. 

    DataRobot’s end-to-end enterprise AI platform addresses these demands. By preparing data, automating the training of machine learning models, operationalizing AI, and continuously monitoring AI assets, DataRobot enables organizations to enhance prediction accuracy, accelerate time to insight, reduce risk, and increase revenue – all without requiring heavy investment in data science teams. 

    Read the study to learn more.

    Study
    Forrester Total Economic Impact™ Study of DataRobot

    The post Forrester Total Economic Impact Study of DataRobot: 514% ROI and Payback within 3 Months appeared first on DataRobot.



    from Blog – DataRobot https://ift.tt/2FLCs0k
    via Gabe's MusingsGabe's Musings

    Saturday, August 22, 2020

    dplyr filter(): Filter/Select Rows based on conditions

    dplyr, R package that is at core of tidyverse suite of packages, provides a great set of tools to manipulate datasets in the tabular form. dplyr has a set of useful functions for “data munging”, including select(), mutate(), summarise(), and arrange() and filter().

    And in this tidyverse tutorial, we will learn how to use dplyr’s filter() function to select or filter rows from a data frame with multiple examples. First, we will start with how to select rows of a dataframe based on a value of a single column or variable. And then we will learn how select rows of a dataframe using values from multiple variables or columns.

    Let us get started by loading tidyverse, suite of R packges from RStudio.

    library("tidyverse")
    

    We will load Penguins data directly from cmdlinetips.com‘s github page.

    path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
    penguins<- readr::read_csv(path2data)
    

    Penguins data look like this

    head(penguins)
    ## # A tibble: 6 x 7
    ##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
    ##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
    ## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
    ## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
    ## 3 Adelie  Torge…           40.3          18                195        3250 fema…
    ## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
    ## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
    ## 6 Adelie  Torge…           39.3          20.6              190        3650 male
    

    Let us subset Penguins data by filtering rows based on one or more conditions.

    How to filter rows based on values of a single column in R?

    Let us learn how to filter data frame based on a value of a single column. In this example, we want to subset the data such that we select rows whose “sex” column value is “fename”.

    penguins %>% 
      filter(sex=="female")
    

    This gives us a new dataframe , a tibble, containing rows with sex column value “female”column.

    ## # A tibble: 165 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Adelie  Torge…           39.5          17.4              186        3800
    ##  2 Adelie  Torge…           40.3          18                195        3250
    ##  3 Adelie  Torge…           36.7          19.3              193        3450
    ##  4 Adelie  Torge…           38.9          17.8              181        3625
    ##  5 Adelie  Torge…           41.1          17.6              182        3200
    ##  6 Adelie  Torge…           36.6          17.8              185        3700
    ##  7 Adelie  Torge…           38.7          19                195        3450
    ##  8 Adelie  Torge…           34.4          18.4              184        3325
    ##  9 Adelie  Biscoe           37.8          18.3              174        3400
    ## 10 Adelie  Biscoe           35.9          19.2              189        3800
    ## # … with 155 more rows, and 1 more variable: sex <chr>
    
    

    In our first example using filter() function in dplyr, we used the pipe operator “%>%” while using filter() function to select rows. Like other dplyr functions, we can also use filter() function without the pipe operator as shown below.

    filter(penguins, sex=="female")
    

    And we will get the same results as shown above.

    In the above example, we selected rows of a dataframe by checking equality of variable’s value. We can also use filter to select rows by checking for inequality, greater or less (equal) than a variable’s value.

    Let us see an example of filtering rows when a column’s value is not equal to “something”. In the example below, we filter dataframe whose species column values are not “Adelie”.

    penguins %>% 
      filter(species != "Adelie")
    

    We now get a filtered dataframe with species other than “Adelie”

    ## # A tibble: 192 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Gentoo  Biscoe           46.1          13.2              211        4500
    ##  2 Gentoo  Biscoe           50            16.3              230        5700
    ##  3 Gentoo  Biscoe           48.7          14.1              210        4450
    ##  4 Gentoo  Biscoe           50            15.2              218        5700
    ##  5 Gentoo  Biscoe           47.6          14.5              215        5400
    ##  6 Gentoo  Biscoe           46.5          13.5              210        4550
    ##  7 Gentoo  Biscoe           45.4          14.6              211        4800
    ##  8 Gentoo  Biscoe           46.7          15.3              219        5200
    ##  9 Gentoo  Biscoe           43.3          13.4              209        4400
    ## 10 Gentoo  Biscoe           46.8          15.4              215        5150
    ## # … with 182 more rows, and 1 more variable: sex <chr>
    

    dplyr filter() with greater than condition

    When the column of interest is a numerical, we can select rows by using greater than condition. Let us see an example of filtering rows when a column’s value is greater than some specific value.

    In the example below, we filter dataframe such that we select rows with body mass is greater than 6000 to see the heaviest penguins.

    # filter variable greater than a value
    penguins %>% 
      filter(body_mass_g> 6000)
    

    After filtering for body mass, we get just two rows that satisfy body mass condition we provided.

    # # A tibble: 2 x 7
    ##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
    ##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
    ## 1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
    ## 2 Gentoo  Biscoe           59.6          17                230        6050 male
    

    Similarly, we can select or filter rows when a column’s value is less than some specific value.

    dplyr filter() with less than condition

    Similarly, we can also filter rows of a dataframe with less than condition. In this example below, we select rows whose flipper length column is less than 175.

    # filter variable less than a value
    penguins %>% 
      filter(flipper_length_mm <175)
    
    

    Here we get a new tibble with just rows satisfying our condition.

    ## # A tibble: 2 x 7
    ##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
    ##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
    ## 1 Adelie  Biscoe           37.8          18.3              174        3400 fema…
    ## 2 Adelie  Biscoe           37.9          18.6              172        3150 fema…
    

    How to Filter Rows of a dataframe using two conditions?

    With dplyr’s filter() function, we can also specify more than one conditions. In the example below, we have two conditions inside filter() function, one specifies flipper length greater than 220 and second condition for sex column.

    # 2.6.1 Boolean AND
    penguins %>% 
      filter(flipper_length_mm >220 & sex=="female")
    
    ## # A tibble: 1 x 7
    ##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
    ##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
    ## 1 Gentoo  Biscoe           46.9          14.6              222        4875 fema…
    
    

    dplyr’s filter() function with Boolean OR

    We can filter dataframe for rows satisfying one of the two conditions using Boolean OR. In this example, we select rows whose flipper length value is greater than 220 or bill depth is less than 10.

    penguins %>% 
      filter(flipper_length_mm >220 | bill_depth_mm < 10)
    
    ## # A tibble: 35 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Gentoo  Biscoe           50            16.3              230        5700
    ##  2 Gentoo  Biscoe           49.2          15.2              221        6300
    ##  3 Gentoo  Biscoe           48.7          15.1              222        5350
    ##  4 Gentoo  Biscoe           47.3          15.3              222        5250
    ##  5 Gentoo  Biscoe           59.6          17                230        6050
    ##  6 Gentoo  Biscoe           49.6          16                225        5700
    ##  7 Gentoo  Biscoe           50.5          15.9              222        5550
    ##  8 Gentoo  Biscoe           50.5          15.9              225        5400
    ##  9 Gentoo  Biscoe           50.1          15                225        5000
    ## 10 Gentoo  Biscoe           50.4          15.3              224        5550
    ## # … with 25 more rows, and 1 more variable: sex <chr>
    

    Select rows with missing value in a column

    Often one might want to filter for or filter out rows if one of the columns have missing values. With is.na() on the column of interest, we can select rows based on a specific column value is missing.

    In this example, we select rows or filter rows with bill length column with missing values.

    penguins %>% 
     filter(is.na(bill_length_mm))
    

    In this dataset, there are only two rows with missing values in bill length column.

    ## # A tibble: 2 x 8
    ##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
    ##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
    ## 1 Adelie  Torge…             NA            NA               NA          NA <NA> 
    ## 2 Gentoo  Biscoe             NA            NA               NA          NA <NA> 
    ## # … with 1 more variable: year <int>
    

    We can also use negation symbol “!” to reverse the selection. In this example, we select rows with no missing values for sex column.

    penguins %>% 
      filter(!is.na(sex))
    

    Note that this filtering will keep rows with other column values with missing values.

    ## # A tibble: 333 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Adelie  Torge…           39.1          18.7              181        3750
    ##  2 Adelie  Torge…           39.5          17.4              186        3800
    ##  3 Adelie  Torge…           40.3          18                195        3250
    ##  4 Adelie  Torge…           36.7          19.3              193        3450
    ##  5 Adelie  Torge…           39.3          20.6              190        3650
    ##  6 Adelie  Torge…           38.9          17.8              181        3625
    ##  7 Adelie  Torge…           39.2          19.6              195        4675
    ##  8 Adelie  Torge…           41.1          17.6              182        3200
    ##  9 Adelie  Torge…           38.6          21.2              191        3800
    ## 10 Adelie  Torge…           34.6          21.1              198        4400
    ## # … with 323 more rows, and 1 more variable: sex <chr>
    

    The post dplyr filter(): Filter/Select Rows based on conditions appeared first on Python and R Tips.



    from Python and R Tips https://ift.tt/3ld5Ht4
    via Gabe's MusingsGabe's Musings

    Tuesday, August 11, 2020

    dplyr arrange(): Sort/Reorder by One or More Variables

    dplyr, R package part of tidyverse suite of packages, provides a great set of tools to manipulate datasets in the tabular form. dplyr has a set of core functions for “data munging”,including select(),mutate(), filter(), summarise(), and arrange().

    And in this tidyverse tutorial, we will learn how to use dplyr’s arrange() function to sort a data frame in multiple ways. First we will start with how to sort a dataframe by values of a single variable, And then we will learn how to sort a dataframe by more than one variable in the dataframe. By default, dplyr’s arrange() sorts in ascending order, we will also learn to sort in descending order.

    Let us get started by loading tidyverse, suite of R packges from RStudio.

    library("tidyverse")
    

    We will use the fantastic Penguins dataset to illustrate the three ways to see data in a dataframe. Let us load the data from cmdlinetips.com’ github page.

    path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
    penguins<- readr::read_csv(path2data)
    
    ## Parsed with column specification:
    ## cols(
    ##   species = col_character(),
    ##   island = col_character(),
    ##   bill_length_mm = col_double(),
    ##   bill_depth_mm = col_double(),
    ##   flipper_length_mm = col_double(),
    ##   body_mass_g = col_double(),
    ##   sex = col_character()
    ## )
    
    head(penguins)
    
    ## # A tibble: 6 x 7
    ##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
    ##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
    ## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
    ## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
    ## 3 Adelie  Torge…           40.3          18                195        3250 fema…
    ## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
    ## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
    ## 6 Adelie  Torge…           39.3          20.6              190        3650 male
    

    How To Sort a Dataframe by a single Variable with dplyr’s arrange()?

    We can use dplyr’s arrange() function to sort a dataframe by one or more variables. Let us say we want to sort Penguins dataframe by its body mass to quickly learn about smallest weighing penguin and its relations to other variables.

    We will use pipe operator “%>%” to feed the data to the dplyr function arrange(). We need to specify name of the variable that we want to sort dataframe. In this example, we are sorting by variable “body_mass_g”.

    penguins %>% 
      arrange(body_mass_g)
    

    dplyr’s arrange() sorts the dataframe by the variable and outputs a new dataframe (as a tibble). You can notice that the resulting dataframe is different from the original dataframe. We can see that body_mass_g column arranged from smallest to largest values.

    ## # A tibble: 344 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Chinst… Dream            46.9          16.6              192        2700
    ##  2 Adelie  Biscoe           36.5          16.6              181        2850
    ##  3 Adelie  Biscoe           36.4          17.1              184        2850
    ##  4 Adelie  Biscoe           34.5          18.1              187        2900
    ##  5 Adelie  Dream            33.1          16.1              178        2900
    ##  6 Adelie  Torge…           38.6          17                188        2900
    ##  7 Chinst… Dream            43.2          16.6              187        2900
    ##  8 Adelie  Biscoe           37.9          18.6              193        2925
    ##  9 Adelie  Dream            37.5          18.9              179        2975
    ## 10 Adelie  Dream            37            16.9              185        3000
    ## # … with 334 more rows, and 1 more variable: sex <chr>
    

    How To Sort or Reorder Rows in Descending Order with dplyr’s arrange()?

    By default, dplyr’s arrange() sorts in ascending order. We can sort by a variable in descending order using desc() function on the variable we want to sort by. For example, to sort the dataframe by body_mass_g in descending order we use

    penguins %>%
     arrange(desc(body_mass_g))
    
    ## # A tibble: 344 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Gentoo  Biscoe           49.2          15.2              221        6300
    ##  2 Gentoo  Biscoe           59.6          17                230        6050
    ##  3 Gentoo  Biscoe           51.1          16.3              220        6000
    ##  4 Gentoo  Biscoe           48.8          16.2              222        6000
    ##  5 Gentoo  Biscoe           45.2          16.4              223        5950
    ##  6 Gentoo  Biscoe           49.8          15.9              229        5950
    ##  7 Gentoo  Biscoe           48.4          14.6              213        5850
    ##  8 Gentoo  Biscoe           49.3          15.7              217        5850
    ##  9 Gentoo  Biscoe           55.1          16                230        5850
    ## 10 Gentoo  Biscoe           49.5          16.2              229        5800
    ## # … with 334 more rows, and 1 more variable: sex <chr>
    

    How To Sort a Dataframe by Two Variables?

    With dplyr’s arrange() function we can sort by more than one variable. To sort or arrange by two variables, we specify the names of two variables as arguments to arrange() function as shown below. Note that the order matters here.

    penguins %>% 
       arrange(body_mass_g,flipper_length_mm)
    

    In this example here, we have body_mass_g first and flipper_length_mm second. dplyr’s arrange() sorts by these two variables such that for each value the first variable, dplyr under the good subsets the data and sorts by second variable.

    For example, we can see that starting from second row body_mass_g has the same values and the flipper_length is sorted in ascending order.

    
    ## # A tibble: 344 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Chinst… Dream            46.9          16.6              192        2700
    ##  2 Adelie  Biscoe           36.5          16.6              181        2850
    ##  3 Adelie  Biscoe           36.4          17.1              184        2850
    ##  4 Adelie  Dream            33.1          16.1              178        2900
    ##  5 Adelie  Biscoe           34.5          18.1              187        2900
    ##  6 Chinst… Dream            43.2          16.6              187        2900
    ##  7 Adelie  Torge…           38.6          17                188        2900
    ##  8 Adelie  Biscoe           37.9          18.6              193        2925
    ##  9 Adelie  Dream            37.5          18.9              179        2975
    ## 10 Adelie  Dream            37            16.9              185        3000
    ## # … with 334 more rows, and 1 more variable: sex <chr>
    

    Notice the difference in results we get by changing the order of two variables we want to sort by. In the example below we have flipper_length first and body_mass next.

    penguins %>%
      arrange(flipper_length_mm,body_mass_g)
    

    Now our dataframe is first sorted by flipper_length and then by body_mass.

    ## # A tibble: 344 x 7
    ##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
    ##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
    ##  1 Adelie  Biscoe           37.9          18.6              172        3150
    ##  2 Adelie  Biscoe           37.8          18.3              174        3400
    ##  3 Adelie  Torge…           40.2          17                176        3450
    ##  4 Adelie  Dream            33.1          16.1              178        2900
    ##  5 Adelie  Dream            39.5          16.7              178        3250
    ##  6 Chinst… Dream            46.1          18.2              178        3250
    ##  7 Adelie  Dream            37.2          18.1              178        3900
    ##  8 Adelie  Dream            37.5          18.9              179        2975
    ##  9 Adelie  Dream            42.2          18.5              180        3550
    ## 10 Adelie  Biscoe           37.7          18.7              180        3600
    ## # … with 334 more rows, and 1 more variable: sex <chr>
    

    The post dplyr arrange(): Sort/Reorder by One or More Variables appeared first on Python and R Tips.



    from Python and R Tips https://ift.tt/3ivjkSs
    via Gabe's MusingsGabe's Musings