Gabe's Musings: Gabe's Musings

Showing posts with label Gabe's Musings. Show all posts

Friday, November 6, 2020

Using Automated Computer Vision to See the Big Picture

Data comes in all shapes and sizes. Just like our daily decision-making processes, the more points of view we have, the better the potential outcome. For example, when considering a hospital readmission dataset, we can use traditional features, such as patients’ gender, readmission date, and the number of previous lab procedures. In order to strengthen the artificial intelligence (AI) models and get even better predictions, we can add, for example, an image of a patient’s MRI and raw text from doctors’ notes.

In the webinar DataRobot Visual AI: See the Big Picture with Automated Computer Vision, we explored these issues, discussing how deep learning and computer vision algorithms work and the way companies can apply DataRobot Visual AI for their daily operations.

Visual AI Demystifies Deep Learning

You don’t need to be an expert in data science and deep learning to understand the process of training neural networks and creating models by yourself. DataRobot Visual AI is extremely easy to use. Just drag and drop a zip file containing your images, their labels, and any other set of features you like into DataRobot to get started. You then pick your target feature and hit the Start button in the usual way.

The deep learning models DataRobot builds are just as easy to understand. Visual AI provides detailed blueprints to explain every pre-processing step and which algorithms were used and a variety of human-friendly visual insights to help you interpret results, including image activation maps and row-by-row prediction explanations.

Visual AI is part of DataRobot’s Automated Machine Learning product which is tightly integrated with DataRobot MLOps. MLOps makes it incredibly easy to deploy your deep learning model to your production environment of choice in just a few clicks. You are then able to monitor and manage your Visual AI models over its full production lifecycle in a tightly governed environment.

Example Visual AI Use Cases

How do companies use automated computer vision? In the webinar, we discuss some examples, including the early detection of forest fires from aerial photography or the assessment of hurricane property damage using images from drones in areas that inspectors on the ground cannot access. Another application discussed was the automated capability to distinguish people who are wearing masks from those who are not before being admitted into a building or office.

In the healthcare industry, providers can use more complex models with a wide variety of features to recognize types of skin lesion and the probability of skin cancer developing.

Conclusion

The different ways in which DataRobot Visual AI can be used is limited only by the human imagination. If you are interested in learning more about how Visual AI can add value to your business or organization, be sure to watch the whole webinar, DataRobot Visual AI: See the Big Picture with Automated Computer Vision.

On-demand Webinar

DataRobot Visual AI: See the Big Picture with Automated Computer Vision

Watch now

The post Using Automated Computer Vision to See the Big Picture appeared first on DataRobot.

from Blog – DataRobot https://ift.tt/3euqzJH
via Gabe's MusingsGabe's Musings

Saturday, October 31, 2020

How To Remove Rows with Missing values using dplyr?

Missing data is a common problem while doing data analysis. Sometimes you might to remove the missing data. One approach is to remove rows containing missing values. In this post we will see examples of removing rows containing missing values using dplyr in R.

How To Remove Rows With Missing Values?

We will use dplyr’s function drop_na() to remove rows that contains missing data. Let us load tidyverse first.

library("tidyverse")

As in other tidyverse 101 examples, we will use the fantastic Penguins dataset to illustrate the three ways to see data in a dataframe. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)

Let us move sex column which has a number of missing values to the front using dplyr’s relocate() function.

# move sex column to first
penguins <- penguins %>% 
            relocate(sex)

We can see that our data frame has 344 rows in total and a number of rows have missing values. Note the fourth row has missing values for most the columns and it is represented as “NA”.

penguins


## # A tibble: 344 x 7
##   sex   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##   <chr> <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
## 1 male  Adelie  Torge…           39.1          18.7              181        3750
## 2 fema… Adelie  Torge…           39.5          17.4              186        3800
## 3 fema… Adelie  Torge…           40.3          18                195        3250
## 4 <NA>  Adelie  Torge…           NA            NA                 NA          NA
## 5 fema… Adelie  Torge…           36.7          19.3              193        3450
## 6 male  Adelie  Torge…           39.3          20.6              190        3650

Let us use dplyr’s drop_na() function to remove rows that contain at least one missing value.

penguins %>% 
  drop_na()

Now our resulting data frame contains 333 rows after removing rows with missing values. Note that the fourth row in our original dataframe had missing values and now it is removed.

## # A tibble: 333 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Adelie  Torge…           39.1          18.7              181        3750
##  2 Adelie  Torge…           39.5          17.4              186        3800
##  3 Adelie  Torge…           40.3          18                195        3250
##  4 Adelie  Torge…           36.7          19.3              193        3450
##  5 Adelie  Torge…           39.3          20.6              190        3650
##  6 Adelie  Torge…           38.9          17.8              181        3625

How to Remove Rows Based on Missing Values in a Column?

Sometimes you might want to removes rows based on missing values in one or more columns in the dataframe. To remove rows based on missing values in a column.

penguins %>% 
  drop_na(bill_length_mm)

We have removed the rows based on missing values in bill_length_mm column. In comparison to the above example, the resulting dataframe contains missing values from other columns. In this example, we can see missing values Note that

## # A tibble: 342 x 7
##    sex   species island bill_length_mm bill_depth_mm flipper_length_…
##    <chr> <chr>   <chr>           <dbl>         <dbl>            <dbl>
##  1 male  Adelie  Torge…           39.1          18.7              181
##  2 fema… Adelie  Torge…           39.5          17.4              186
##  3 fema… Adelie  Torge…           40.3          18                195
##  4 fema… Adelie  Torge…           36.7          19.3              193
##  5 male  Adelie  Torge…           39.3          20.6              190
##  6 fema… Adelie  Torge…           38.9          17.8              181
##  7 male  Adelie  Torge…           39.2          19.6              195
##  8 <NA>  Adelie  Torge…           34.1          18.1              193
##  9 <NA>  Adelie  Torge…           42            20.2              190
## 10 <NA>  Adelie  Torge…           37.8          17.1              186
## # … with 332 more rows, and 1 more variable: body_mass_g <dbl>

The post How To Remove Rows with Missing values using dplyr? appeared first on Python and R Tips.

from Python and R Tips https://ift.tt/2TGZGZl
via Gabe's MusingsGabe's Musings

Friday, October 30, 2020

How To Change Column Position with dplyr?

In this post we will learn how to change column order or move a column in R with dplyr. More specifically, we will learn how to move a single column of interest to first in the dataframe, before and after a specific column in the dataframe. We will use relocate() function available in dplyr version 1.0.0 to change the column position.

Let us load tidyverse first.

library("tidyverse")

As in other tidyverse 101 examples, we will use the fantastic Penguins dataset to illustrate the three ways to see data in a dataframe. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)

Note that the last column in the data frame is sex column.

## Parsed with column specification:
## cols(
##   species = col_character(),
##   island = col_character(),
##   bill_length_mm = col_double(),
##   bill_depth_mm = col_double(),
##   flipper_length_mm = col_double(),
##   body_mass_g = col_double(),
##   sex = col_character()
## )

First, we will see how to move a column to first in the dataframe. To move a column to first in the dataframe, we use relocate() with the column name we want to move.

penguins %>% 
  relocate(sex)

This will move the column of interest to the first column.

## # A tibble: 344 x 7
##    sex   species island bill_length_mm bill_depth_mm flipper_length_…
##    <chr> <chr>   <chr>           <dbl>         <dbl>            <dbl>
##  1 male  Adelie  Torge…           39.1          18.7              181
##  2 fema… Adelie  Torge…           39.5          17.4              186
##  3 fema… Adelie  Torge…           40.3          18                195
##  4 <NA>  Adelie  Torge…           NA            NA                 NA
##  5 fema… Adelie  Torge…           36.7          19.3              193
##  6 male  Adelie  Torge…           39.3          20.6              190
##  7 fema… Adelie  Torge…           38.9          17.8              181
##  8 male  Adelie  Torge…           39.2          19.6              195
##  9 <NA>  Adelie  Torge…           34.1          18.1              193
## 10 <NA>  Adelie  Torge…           42            20.2              190
## # … with 334 more rows, and 1 more variable: body_mass_g <dbl>

We can also move the column of interest to a location after another column in the dataframe. In this example, we move the column “sex” to position after “species” column.

penguins %>% 
  relocate(sex, .after=species)

Notice that now the sex column is second column after the species.

## # A tibble: 344 x 7
##    species sex   island bill_length_mm bill_depth_mm flipper_length_…
##    <chr>   <chr> <chr>           <dbl>         <dbl>            <dbl>
##  1 Adelie  male  Torge…           39.1          18.7              181
##  2 Adelie  fema… Torge…           39.5          17.4              186
##  3 Adelie  fema… Torge…           40.3          18                195
##  4 Adelie  <NA>  Torge…           NA            NA                 NA
##  5 Adelie  fema… Torge…           36.7          19.3              193
##  6 Adelie  male  Torge…           39.3          20.6              190
##  7 Adelie  fema… Torge…           38.9          17.8              181
##  8 Adelie  male  Torge…           39.2          19.6              195
##  9 Adelie  <NA>  Torge…           34.1          18.1              193
## 10 Adelie  <NA>  Torge…           42            20.2              190
## # … with 334 more rows, and 1 more variable: body_mass_g <dbl>

Similarly we can also specify the location to be after another column present in the dataframe. In this example, we move sex column to be relocated after “bill_length_mm”.

penguins %>% 
  relocate(sex, .before=bill_length_mm)

## # A tibble: 344 x 7
##    species island sex   bill_length_mm bill_depth_mm flipper_length_…
##    <chr>   <chr>  <chr>          <dbl>         <dbl>            <dbl>
##  1 Adelie  Torge… male            39.1          18.7              181
##  2 Adelie  Torge… fema…           39.5          17.4              186
##  3 Adelie  Torge… fema…           40.3          18                195
##  4 Adelie  Torge… <NA>            NA            NA                 NA
##  5 Adelie  Torge… fema…           36.7          19.3              193
##  6 Adelie  Torge… male            39.3          20.6              190
##  7 Adelie  Torge… fema…           38.9          17.8              181
##  8 Adelie  Torge… male            39.2          19.6              195
##  9 Adelie  Torge… <NA>            34.1          18.1              193
## 10 Adelie  Torge… <NA>            42            20.2              190
## # … with 334 more rows, and 1 more variable: body_mass_g <dbl>

In this post, we saw how to move a single column to first and before or after another column. dplyr’s relocate() is versatile and can conditions as input to move multiple columns at the same time. Check out soon for more examples of using dplyr’s relocate().

The post How To Change Column Position with dplyr? appeared first on Python and R Tips.

from Python and R Tips https://ift.tt/37U0Bhm
via Gabe's MusingsGabe's Musings

Wednesday, October 21, 2020

Countdown to AI Experience EMEA Virtual Conference

The AI Experience EMEA Virtual Conference is just three weeks away, and Team DataRobot couldn’t be more excited about what our speakers have in store for our virtual audience. Conference attendees will walk away with pragmatic ideas for how to accelerate time to impact and value for AI, and practical strategies to implement those ideas within their organizations — all at no cost to attend. This day-long event is an experience that no AI visionary will want to miss.

We’re pleased to announce two keynote speakers that conference attendees can look forward to hearing from on November 10th.

Brian Prestidge: Director of Insights & Decision Technology at Manchester City Football Club

With 15 years working in professional football in various roles supporting elite performance practitioners, Brian has seen the technological developments that have created an exponential growth in data available to football clubs. Whether through the use of AI & simulation for player development or the application of robust data science methods to support coaches in their game preparation, Brian & his team play a key role at City Football Group in enabling them to make better and faster decisions in the very dynamic and heavily scrutinised environment of professional football.

Dr. Hannah Fry: Associate Professor in the Mathematics of Cities, Centre for Advanced Spatial Analysis at University College London

Dr. Hannah Fry is an Associate Professor in the Mathematics of Cities at the Centre for Advanced Spatial Analysis at UCL where she studies patterns in human behavior. Her research applies to a wide range of social problems and questions, from shopping and transport to urban crime, riots and terrorism.

Her critically acclaimed BBC documentaries include Horizon: Diagnosis on Demand? The Computer Will See You Now, Britain’s Greatest Invention, City in the Sky (BBC Two), Magic Numbers: Hannah Fry’s Mysterious World of Maths, The Joy of Winning, The Joy of Data, Contagion! The BBC Four Pandemic and Calculating Ada (BBC Four). She also co-presents The Curious Cases of Rutherford and Fry (BBC Radio 4) and The Maths of Life with Lauren Laverne (BBC Radio 6).

Hannah is the author of Hello World, published in 2018.

We hope you’ll join us on November 10th to hear these keynotes, and our full lineup of powerhouse speakers, share their insights on impactful, trustworthy AI. Leaders from Bayer Pharmaceuticals, Deutsche Post DHL Group, Medical Faculty Manheim, Heidelberg University, and more will help you understand how to leverage AI to address hyper-critical issues impacting your organization.

Virtual Event

AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI

The post Countdown to AI Experience EMEA Virtual Conference appeared first on DataRobot.

from Blog – DataRobot https://ift.tt/3kiPGkz
via Gabe's MusingsGabe's Musings

Monday, October 19, 2020

AI in Financial Markets, Part 5 of 4: Flippantly Answered Questions

This was going to be a four-part blog series. I figured that I’d covered most of what financial markets participants might be interested in hearing about when it comes to AI in the markets (and had probably bored the audience for long enough, all things considered). Then we did a webinar on the topic together with FactSet, had a great turnout, and got asked lots of great questions from our listeners — so many, in fact, that we had to answer them by email as we ran out of time. And the questions were sufficiently good that we thought you might be interested in reading the Q&A, too, so here goes.

Financial markets are known to have an extremely non-stable structure. How do you find the right balance within your model between having a model that reacts quickly using mostly recent data (to have a better fit) vs. having a more stable model that can use longer (hence, more) data?

In one word: empirically. Remember that in financial markets, in particular, and time-series modeling in general, more history doesn’t automatically mean that you get better models. If a longer history means that you enter a different behavioral régime, your model performance will deteriorate. So trying out different lengths of the training period can be a valuable way of discovering how persistent a behavior actually is. Don’t go nuts, though. You need to avoid trying out so many variations that you end up with something that, by pure randomness, looks good but doesn’t generalize. You can avoid this kind of overfitting by being specific in your hypothesis formulation and rigorously testing your candidate models on multiple out-of-sample, out-of-time datasets; at all costs, avoid selection and look-ahead biases.

Are there particular areas of the markets where you see machine learning working better or worse than in others?

We see machine learning models adding value in lots of different areas of the markets and many different asset classes. As previously discussed, we see particular power in predicting second-order variables (i.e. modeling variables that influence returns, instead of directly predicting returns) and in a variety of use cases that form part of the business of financial markets (see blog post 1).

One interesting pattern that we have noticed is that generally, the longer the prediction horizon or data frequency, the more difficult it is to build good machine learning models. So quarterly numbers tend to be very difficult to work with, and monthly numbers can be a struggle sometimes, too. With such infrequent data, there is a trade-off between using a very limited number of data points in order to not go beyond the current behavioral environment (limited data often makes it hard to build good models), or working with a longer history that spans multiple behavioral régimes (thus being a worse fit for the individual ones, so again it’s hard to build good models). On the other hand, there are a lot of great use cases using tick and market microstructure data, which is a bit like a firehose: if you need some more data to work with, you just need to open the tap for a little while.

Another pattern we see is not a surprise: the less efficient the price discovery mechanism is, the more value that machine learning models can add—as long as there’s enough training data. So machine learning models probably won’t add much value on the U.S. Treasury Bond future, nor will they on a corporate bond that trades once every six months: the sweet spot will be somewhere in between.

One other thing that’s worth mentioning: simulation-type problems can be a bit of a tricky fit for supervised machine learning, as there’s often no clear concept of a “target variable” to predict. That said, you can use machine learning models to make the predictions which enter the simulations, instead of linear or parametric models; this generally doesn’t make the process any faster, but allows the simulator to take advantage of non-linearity, which can help. In some cases, machine learning can also be used to predict simulation outcomes as a function of the simulation input parameters; this can, for instance, make it much faster to price certain exotic derivatives.

You mentioned the difficulties that certain machine learning algorithms have with extrapolating beyond the bounds of the training data. If need be, can you focus on those algorithms that are able to extrapolate in DataRobot?

Yes, very easily — DataRobot’s leaderboard ranks the different candidate blueprints, models and algorithms by their out-of-sample performance. If you don’t want to use a model based on, say, decision trees, you would simply select a model trained using another algorithm family. The leaderboard comparison will show you whether there’s a trade-off between that model and the tree-based models in terms of out-of-sample performance.

As a reminder: even if an algorithm is able to make predictions that go beyond the limits of the training data, those predictions won’t necessarily make any sense, as you have no experience on whether the behaviors inside the training data are consistent in environments beyond it. Proceed with extreme caution!

How would you handle scenarios where certain predictors are missing data for a few years in a longitudinal dataset, maybe because the data gathering did not begin for that predictor until recently?

First, I’d check whether the predictors actually add any value to the model by building a few different variants of the model: one trained on the full dataset including the variables with limited history (let’s call this Model A), trained over the period for which the full data is available, and another trained over the same period of time but excluding the limited-history variables

(Model B). I’d also train a third model covering the full history that excludes the limited history variables (Model B*). If Model A performs better than Model B, I probably would take that result and not investigate further; if it doesn’t, the comparison between Model B and Model B* will tell me whether adding further history actually helps model performance. If it does, and it’s better than Model A, I’d look for proxies for the limited history variables with a longer history; if not, Model A is good to go.

If you’re referring to a scenario where you’re looking to backtest a strategy over a longer period of time and some of the data in your current model wouldn’t have been available in past periods, the solution is even simpler: evaluate a model built on just the longer history data for the period when the shorter history data isn’t available, then use a model built on the full dataset once it’s available.

So, tl;dr: try out different variants, empiricism wins again. Don’t go crazy with the different variants, as you don’t want to do the data science version of p-value hacking (quants will know this as the dreaded “data mining”). But comparing multiple models built in different ways usually gives good insights, especially when using DataRobot’s standardized analytics.

We hear a lot about the hybrid approach in machine learning. What is it, and does DataRobot support it?

Generally, hybrid approaches in machine learning combine two or more different types of algorithms in order to reduce model error and potentially solve problems which the individual algorithms would be less suited to, or less performant at. DataRobot has quite a few blueprints (machine learning pipelines) which use such approaches, typically combining a supervised machine learning algorithm (one that is designed to predict a target variable by learning from historical observations) with one or more unsupervised learning techniques (clustering, dimensionality reduction). We find that adding clustering, in particular, to a supervised machine learning algorithm like XGBoost can reduce prediction error by 10-15%, depending on the use case.

How does the greedy search algorithm to populate DataRobot’s leaderboard work?

In a nutshell: we first identify the set of all the machine learning pipelines (“blueprints”) that can be applied to the problem at hand, then use a combination of heuristics (to ensure algorithm diversity) and recommendation (to identify those blueprints that are likely to be performant) to identify the initial algorithms. Multiple rounds of model training ensue, starting with a large spectrum of blueprints that are trained on small amounts of data, gradually reducing the number of blueprints trained (filtering on out-of-sample performance), while increasing the size of the training data, finally cross-validating the best-performing algorithms and trying out some ensembles to see whether this will further improve the performance.

Please elaborate on the different types of feature extraction that DataRobot does.

DataRobot does four main kinds of feature extraction and selection automatically:

Transforming features to match a particular machine learning algorithm or make it more performant (automated feature engineering), including dimensionality reduction using techniques such as principal-component analysis or singular value decomposition
Evaluating differences, ratios and other transformations and combinations in datasets where observations are independent (automated feature discovery)
Constructing rolling transformations and evaluating different lags in time series problems where autoregressiveness is present (time series feature engineering)
Automatically generating a reduced feature list on a modeling project’s best model and retraining it (automated feature selection)

Additionally, users have the flexibility to build a wide range of feature transformations using the DataRobot Paxata data preparation platform before pushing the data to DataRobot MLdev. The MLdev API also integrates seamlessly with Python and R’s powerful data preparation capabilities, as well as providing connectivity to other databases such as KDB.

What are the advantages of an enterprise solution like DataRobot compared to open platforms like scikit-learn or Tensorflow?

Cutting edge data science and machine learning are simply unthinkable without open-source packages such as Tensorflow; this is where the innovation lies these days. That said, DataRobot is not built by the crowd. We have some 350 incredibly talented engineers and data scientists on our team, whose job it is to engineer our platform to enterprise grade and work with our customers to ensure that it meets their needs. This includes a number of contributors to popular open-source libraries such as numpy, pandas, scikit-learn, keras, caret, pic2vec, urllib3 and many others.

So we take the best of what’s out there in the open-source data science community and ensure that it’s suitable for enterprise use — contributing to the open source community itself when needed to make this happen. For example, recently members of our team were building some modeling pipelines, including elements from an open-source machine learning library which we had not previously supported. Their testing revealed some critical bugs under the hood and development efforts were then refocused towards fixing the actual open-source library and pushing those changes out to the community.

With a “best of both worlds” solution such as DataRobot, there’s still someone at the end of a phone to shout at if there’s an issue. And you don’t have to worry about making sure that all the parts of the open source stack are up-to-date either.

Does the DataRobot engine run on my desktop computer? How is performance managed, CPU vs GPU selection, etc?

DataRobot is a powerful platform whose requirements exceed the capabilities of a single desktop computer. There are various ways of running the DataRobot platform:

On DataRobot’s managed AI cloud
Via the FactSet Workstation, with the backend running on Factset’s AWS cloud
Inside your enterprise’s firewall, on a Virtual Private Cloud such as Microsoft Azure, Amazon Web Services or Google Cloud
Inside your enterprise’s firewall, on a data lake/cluster running Hadoop; and
Inside your enterprise’s firewall, on a bare-metal Linux cluster

Performance is managed dynamically by the DataRobot app engine, with the user being able to choose how much compute to apply to a modeling project by selecting the number of workers (each worker being able to train one machine learning model at one time). DataRobot runs entirely on CPUs, no expensive GPUs are needed.

Would you say that DataRobot’s learning curve is digestible for a portfolio manager or analyst, or is it targeted at in-house data analysts and quants who would live in the app?

I’d say that a certain amount of data literacy is important — I wouldn’t expect great results from giving DataRobot to an “old school” portfolio manager who struggles with Excel, for instance. We have two target user groups: first, people who understand the data well but aren’t quants or machine learning experts and want to be able to harness the power of machine learning without needing to get technical or learn how to code. We greatly automate the process with smart default settings and a variety of guardrails for this “democratization” audience. Through the process of using DataRobot and its built-in explainability and documentation, such users learn a lot about machine learning and how to frame machine learning problems, often quickly moving on to complex problem sets.

Our other target group is, of course, sophisticated quants and data scientists, who use DataRobot’s automation as a force multiplier for their productivity, by automating the boring, repetitive stuff where they don’t necessarily have an edge.

Is there a course designed around DataRobot to give us hands-on experience?

A wide variety of instructor-led and self-paced training programmes for different skill levels are available at https://university.datarobot.com/, with further resources and self-paced learning at DataRobot Community’s learning center: https://community.datarobot.com/t5/learning-center/ct-p/Learning

There’s also the DataRobot free trial, details at:

In your demo, you built 72 different models to solve this binary classification problem. Some users may not have the expertise in machine learning to make the choice between models, and blindly using a model can be dangerous. What do you do to prevent from giving a machine gun to a 3 year old?

Great question. It’s a combination of several things that work together.

First, make sure that the machine gun has “safety mechanisms” such as built-in best practices and guardrails. For example, rank models strictly on their out-of-sample performance and no in-sample performance data ever being exposed, and combine the appropriate data engineering with each algorithm in the machine learning pipelines.

Second, train the users in “gun safety.” This doesn’t have to take that long — for instance, our Citizen Data Scientist Starter Quest takes an afternoon and is self paced; our AutoML I course consists of three four-hour sessions — but provides valuable context in how to frame machine learning problems and evaluate the models.

Third, make sure that the gun’s “scope” shows the users what they’re pointing at: provide users with sophisticated, standardized analytics that allow them to evaluate each model’s performance in-depth and understand the model’s drivers and how the model will respond in different scenarios.

And finally, support the users with experienced data scientists, a wealth of self-service content, and a growing online user community. (Sorry, ran out of gun metaphors.)

What, over 2,000 words and you still haven’t answered my question?

Hot damn. Come on over to community.datarobot.com and we’ll do our best to answer it there.

Check out all of the blog series: part 1, part 2, part 3, part 4.

Webinar

Machine Learning for Quant Investing with DataRobot on FactSet

Watch on-demand

The post AI in Financial Markets, Part 5 of 4: Flippantly Answered Questions appeared first on DataRobot.

from Blog – DataRobot https://ift.tt/2IA7f1x
via Gabe's MusingsGabe's Musings

Tuesday, October 6, 2020

Tableau Conference-ish 2020

Unlock AI For BI

DataRobot is proud to once again be sponsoring Tableau Conference, or Tableau Conference-ish 2020 (TC20) as it has been minted this year for its virtual format. From October 6-8th, we invite you to learn more about the perfect union of AI and BI at our virtual booth. Swing by for helpful content, the opportunity to sign up for a free trial (tailored just for you), and some sought-after DataRobot swag.

TC20 will also provide the opportunity to hear and learn directly from joint DataRobot + Tableau customers during our session, “DataRobot: Unlock AI for BI.” Hear from data leaders from the NHS, Alcon, and our host, Tableau, as they speak with me, Karin Jenson, Director of Business Analyst AI Success. Panelists will share how they use DataRobot with Tableau to leverage AI and automated machine learning to solve complex predictive problems easily. Our panelists will share jaw-dropping use cases and talk about the upward career trajectory that the incorporation of AI and machine learning into business analytics has provided for them. Their stories are truly inspirational and not to be missed. Be sure to take a look at the TC20 agenda to take note of when the session episode airs for you locally.

After the session, join me and other members of the DataRobot team at our virtual booth, where you can sign up for the TC20-specific trial. Not only will the trial get you hands-on experience with DataRobot, but you’ll also get:

Informative resources and self-paced content that will help you succeed in your predictive insights journey
Community support specifically for new users coming from Tableau Conference 2020
The opportunity to join the DataRobot for the Business Analyst advisory board. This exclusive invitation gives you access to DataRobot swag, beta sneak peeks, assistance from leading data scientists, and invitations to quarterly meetings to weigh in on what you need from DataRobot to further meet your needs.

We look forward to connecting with you virtually at Tableau Conference-ish. Until then, please take a moment to get pumped up with a sneak peak of one of our commercials airing during the Super Bowl of analytics.

Virtual Event

Join us at TC20

October 6-8, 2020

The post Tableau Conference-ish 2020 appeared first on DataRobot.

from Blog – DataRobot https://ift.tt/34oXkDk
via Gabe's MusingsGabe's Musings

Friday, October 2, 2020

AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI

In 2020, we’re all pivoting business strategies, focusing on impact, and finding ways to “do it better and different” (coincidentally, a DataRobot core value.) Organizations across industries have needed to become more resilient and focus on reducing costs and risks, retaining customers, and finding new revenue streams.

In June, we held our first-ever virtual AI conference, AI Experience Worldwide to address these challenges. The conference was such a hit amongst attendees that we held another for our APAC audience. And now, we’re coming for EMEA.

November 10, 2020, join us for a full day of AI insights from our in-house experts and industry leaders from a wide range of industries.

This event is designed to bring together AI visionaries, creators and consumers of AI, partners, customers, and decision makers to:

Learn how pragmatic, value-focused AI is making a measurable impact on organizations.
Understand how to address the hyper-critical issues impacting your organization today, throughout the rest of 2020 and into 2021.
Hear AI-driven use cases from pioneers across a range of industries.
Network with industry peers facing the exact same challenges you are.

Trustworthy, impactful AI is an increasingly important priority for organizations. In the recently-released 2020 Gartner Hype Cycle for Artificial Intelligence, algorithmic trust, democratized AI, and AI governance were prominently featured. As you plan for 2021, it’s critical to derive the most value you can from your data.

Our goal is for you to walk away from the AI Experience EMEA Virtual Conference with actionable insights to take your organization to the next level of AI Success with trustworthy, impactful AI. Wherever you are on your AI journey, we’re committed to being your trusted partner every step of the way.

VIRTUAL CONFERENCE

AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI

The post AI Experience EMEA Virtual Conference: Accelerating Impact With Trusted AI appeared first on DataRobot.

from Blog – DataRobot https://ift.tt/34doA7I
via Gabe's MusingsGabe's Musings

Tuesday, September 15, 2020

Seaborn Version 0.11.0 is here with displot, histplot and ecdfplot

Seaborn Version 0.11 is Here

Seaborn, one of the data visualization libraries in Python has a new version, Seaborn version 0.11, with a lot of new updates. One of the biggest changes is that Seaborn now has a beautiful logo. Jokes apart, the new version has a lot of new things to make data visualization better. This is a quick blog post covering a few of the Seaborn updates.

displot() for univariate and bivariate distributions

One of the big new changes is “Modernization of distribution functions” in Seaborn version 0.11. The new version of Seaborn has three new functions displot(), histplot() and ecdfplot() to make visualizing distributions easier. Yes, we don’t have to write your own function to make ECDF plot any more.

Seaborn’s displot() can be used for visualizing both univariate and bivariate distributions. Among these three new function, displot function gives a figure level interface to the common distribution plots in seaborn including histograms (histplot), density plots, empirical distributions (ecdfplot), and rug plots. For example, we can use displot() and create

histplot() with kind=”hist” (this is default)

kdeplot() (with kind=”kde”)

ecdfplot() (with kind=”ecdf”)

We can also add rugplot() to show the actual values of the data to any of these plots.

Don’t get confused with distplot() for displot(). displot() is the new distplot() with better capabilities and distplot() is deprecated starting from this Seaborn version.

With the new displot() function in Seaborn, the plotting function hierarchy kind of of looks like this now covering most of the plotting capabilities.

Searborn Plotting Functions Hierarchy

In addition to catplot() for categorical variables and relplot() for relational plots, we now have displot() covering distributional plots.

Let us get started trying out some of the functionalities. We can install the latest version of Seaborn

pip install seaborn

Let us load seaborn and make sure we have Seaborn version 0.11.

import seaborn as sns
print(sns.__version__)
0.11.0

We will use palmer penguin data set to illustrate some of the new functions and features of seaborn. Penguins data is readily available as part of seaborn and we can load using load_dataset() function.

penguins = sns.load_dataset("penguins")

penguins.head()
        species island  bill_length_mm  bill_depth_mm   flipper_length_mm       body_mass_g     sex
0       Adelie  Torgersen       39.1    18.7    181.0   3750.0  Male
1       Adelie  Torgersen       39.5    17.4    186.0   3800.0  Female
2       Adelie  Torgersen       40.3    18.0    195.0   3250.0  Female
3       Adelie  Torgersen       NaN     NaN     NaN     NaN     NaN
4       Adelie  Torgersen       36.7    19.3    193.0   3450.0  Female

We can create histograms with Seaborn’s histplot() function, KDE plot with kdeplot() function, and ECDF plot with ecdfplot(). However, we primarily use displot() to illustrate Seaborn’s new capabilities.

Histograms with Seaborn displot()

Let us make a simple histogram with Seaborn’s displot() function.

plt.figure(figsize=(10,8))
sns.displot(penguins, 
            x="body_mass_g", 
            bins=25)
plt.savefig("Seaborn_histogram_with_displot.png",
                    format='png',dpi=150)

Here we have also specified the number of bins in the histogram.

Seaborn histogram with displot()

We can also color the histogram by a variable and create overlapping histograms.

plt.figure(figsize=(10,8))
sns.displot(penguins,
            x="body_mass_g", 
            hue="species", 
            bins=25)
plt.savefig("Seaborn_overlapping_histogram_hue_with_displot.png",
                    format='png',dpi=150)

In this example, we color penguins’ body mass by species.

Seaborn displot(): overlapping histograms using hue

Facetting with Seaborn displot()

With “col” argument we can create “small multiples” or faceting to create multiple plots of the same type using subsets of data based on a variable’s value.

plt.figure(figsize=(10,8))
sns.displot(penguins, 
            x="body_mass_g",
            col="species", 
            bins=25)
plt.savefig("Seaborn_facetting_histogram_col_with_displot.png",
                    format='png',dpi=150)

Here, we have facetted by values of penguins’ species in our data set.

Seaborn displot(): facetting histogram using col

Density plot with Seaborn’s displot()

Let us use displot() and create density plot using kind=”kde” argument. Here we also color by species variable using “hue” argument.

plt.figure(figsize=(10,8))
sns.displot(penguins,
            x="body_mass_g", 
            hue="species", 
            kind="kde")
plt.savefig("Seaborn_kernel_density_plot_with_displot.png",
                    format='png',dpi=150)

Seaborn displot(): kernel density plots

Check out the Seaborn documentation, the new version has a new ways to make density plots now.

ECDF Plot with Seaborn’s displot()

One of the personal highlights of Seaborn update is the availability of a function to make ECDF plot. ECDF aka Empirical Cumulative Distribution is a great alternate to visualize distributions.

In an ECDF plot, x-axis correspond to the range of data values for variables and on the y-axis we plot the proportion of data points (or counts) that are less than are equal to corresponding x-axis value.

Unlike histograms and density plot, ECDF plot enables to visualize the data directly without any smoothing parameters like number of bins. Its use possibly visible when you have multiple distributions to visualize.

A potential disadvantage is that

the relationship between the appearance of the plot and the basic properties of the distribution (such as its central tendency, variance, and the presence of any bimodality) may not be as intuitive.

Let us make ecdf plot using displot() using kind=”ecdf”. Here we make ecdf plot of a variable and color it based on values of another variable.

plt.figure(figsize=(10,8))
sns.displot(penguins, 
            x="body_mass_g", 
            hue="species",
            kind="ecdf")
plt.savefig("Seaborn_ecdf_plot_with_displot.png",
                    format='png',dpi=150)

Seaborn displot(): Empirical Cumulative Density Function (ECDF) Plot

Bivariate KDE plot and Histogram with displot()

With kdeplot(), we can also make bivariate density plot. In this example, we use displot() with “kind=’kde'” to make bivariate density/ contour plot.

plt.figure(figsize=(10,8))
sns.displot(data=penguins, 
            x="body_mass_g", 
            y="bill_depth_mm", 
            kind="kde", 
            hue="species")
plt.savefig("Seaborn_displot_bivariate_kde_contour.png",
                    format='png',dpi=150)

Seaborn displot(): bivariate KDE Density plot

We can also make bivariate histogram with displot() using kind=”hist” option or histplot() to make density plot.

plt.figure(figsize=(10,8))
sns.displot(data=penguins, 
            x="body_mass_g",
            y="bill_depth_mm",
            kind="hist", 
            hue="species")
plt.savefig("Seaborn_displot_bivariate_hist.png",
                    format='png',dpi=150)

Seaborn displot() Bivariate histogram

New features to Seaborn jointplot()

With Seaborn 0.11, jointplot also has gained some nice features. Now jointplot() can take “hue” as argument to color data points by a variable.

sns.jointplot(data=penguins, 
              x="body_mass_g", 
              y="bill_depth_mm", 
              hue="species")

Seaborn jointplot color by variable using “hue”

And jointplot() also gets a way to plot bivariate histogram on the joint axes and univariate histograms on the marginal axes using kind=”hist” argument to jointplot().

sns.jointplot(data=penguins, 
              x="body_mass_g", 
              y="bill_depth_mm", 
              hue="species", 
              kind="hist")

Seaborn jointplot color by variable: bivariate histogram

Another big change that will help writing better code to make data visualization is that most Seaborn plotting functions, will now require their parameters to be specified using keyword arguments. Otherwise, you will see FutureWarning in v0.11.

As part of the update, Seaborn has also got spruced up documentation for Seaborn’s capabilities. Check out the new documentation on data structure that is accepted by Seaborn plotting functions. Some of the functions can take the data in both wide and long forms of data. Currently, the distribution and relational plotting functions can handle both and in future releases other Seaborn functions also will get the same data inputs.

The post Seaborn Version 0.11.0 is here with displot, histplot and ecdfplot appeared first on Python and R Tips.

from Python and R Tips https://ift.tt/2RscPUT
via Gabe's MusingsGabe's Musings

Monday, August 24, 2020

Forrester Total Economic Impact Study of DataRobot: 514% ROI and Payback within 3 Months

With AI proving to be the most transformative technology of our time and companies today need to pivot faster and drive near-term impact from their tech investments, many organizations are looking to drive higher ROI from AI as quickly as possible. From predicting customer churn to reducing fraud and avoiding supply chain disruptions, the possibilities are virtually limitless as to how AI can increase revenue, reduce costs, and improve profitability. But how can companies predict the expected value of an AI application or use case so they can justify new investments in the face of tight budgets and headcounts?

To help answer this question, DataRobot today announced the results of a new study: The Total Economic Impact (TEI) of DataRobot. Conducted by Forrester Consulting on behalf of DataRobot, the commissioned study reveals that organizations using DataRobot’s AI platform achieve a net present value (NPV) of $4 million and a 514% return on investment (ROI) with payback often within as short as 3 months.

Forrester interviewed four customers with experience using DataRobot in the retail, healthcare, and manufacturing sectors as the basis for the report to help them better understand the benefits, costs, and risks associated with using the platform. These customers were looking to overcome key challenges, such as forecasting demand and tackling fraud.

Prior to using DataRobot, the customers relied on their data science teams to do the heavy lifting of data preparation, model development and training, and model deployment using traditional open-source technologies, such as the Python and R programming languages and their associated libraries and frameworks.

Some customers were also hindered by their use of legacy data analysis technologies that failed to keep pace with the advancements in AI and machine learning over the past decade. This created environments with lengthy AI project timelines and frequently missed deadlines where organizations often never deployed and operationalized the models they developed.

Forrester Consulting designed a composite organization based on the characteristics of the organizations interviewed. They then constructed a financial model representative of the interview finding, using the TEI methodology based on four fundamental elements: benefits, costs, flexibility, and risks. Forrester’s TEI methodology serves to provide a complete picture of the total economic impact of purchase decisions.

The results were remarkable, with firms reporting significant value relative to cost.

Cost savings from reduced seasonal overhiring in retail: $500,000
Cost savings from reduced healthcare fraud: $10 million
Increased revenue from improved demand forecasting in manufacturing: $50 million – $200 million
Significant cost savings from avoidance of hiring a data science team 3x as large

Determining ROI of AI

Many of our customers ask for help in estimating the value of AI when it’s being used to augment an existing process where some predictive analytics are already in place. The methodology in this report, drawing on data from 4 real-world customer deployments of AI, should provide a useful framework for anyone looking to justify an AI investment.

Today’s business climate has never been so challenging, and organizations need agile trusted solutions that can steer intelligent decision-making in the face of market turbulence and continuously evolving customer needs.

DataRobot’s end-to-end enterprise AI platform addresses these demands. By preparing data, automating the training of machine learning models, operationalizing AI, and continuously monitoring AI assets, DataRobot enables organizations to enhance prediction accuracy, accelerate time to insight, reduce risk, and increase revenue – all without requiring heavy investment in data science teams.

Read the study to learn more.

Study

Forrester Total Economic Impact™ Study of DataRobot

Download now

The post Forrester Total Economic Impact Study of DataRobot: 514% ROI and Payback within 3 Months appeared first on DataRobot.

from Blog – DataRobot https://ift.tt/2FLCs0k
via Gabe's MusingsGabe's Musings

Pages - HTML

Translate

Pages

Pages

Pages

Friday, November 6, 2020

Visual AI Demystifies Deep Learning

Example Visual AI Use Cases

Conclusion

Saturday, October 31, 2020

How to Remove Rows Based on Missing Values in a Column?

Friday, October 30, 2020

Wednesday, October 21, 2020

Brian Prestidge: Director of Insights & Decision Technology at Manchester City Football Club

Dr. Hannah Fry: Associate Professor in the Mathematics of Cities, Centre for Advanced Spatial Analysis at University College London

Monday, October 19, 2020

Financial markets are known to have an extremely non-stable structure. How do you find the right balance within your model between having a model that reacts quickly using mostly recent data (to have a better fit) vs. having a more stable model that can use longer (hence, more) data?

Are there particular areas of the markets where you see machine learning working better or worse than in others?

You mentioned the difficulties that certain machine learning algorithms have with extrapolating beyond the bounds of the training data. If need be, can you focus on those algorithms that are able to extrapolate in DataRobot?

How would you handle scenarios where certain predictors are missing data for a few years in a longitudinal dataset, maybe because the data gathering did not begin for that predictor until recently?

We hear a lot about the hybrid approach in machine learning. What is it, and does DataRobot support it?

How does the greedy search algorithm to populate DataRobot’s leaderboard work?

Please elaborate on the different types of feature extraction that DataRobot does.

What are the advantages of an enterprise solution like DataRobot compared to open platforms like scikit-learn or Tensorflow?

Does the DataRobot engine run on my desktop computer? How is performance managed, CPU vs GPU selection, etc?

Would you say that DataRobot’s learning curve is digestible for a portfolio manager or analyst, or is it targeted at in-house data analysts and quants who would live in the app?

Is there a course designed around DataRobot to give us hands-on experience?

In your demo, you built 72 different models to solve this binary classification problem. Some users may not have the expertise in machine learning to make the choice between models, and blindly using a model can be dangerous. What do you do to prevent from giving a machine gun to a 3 year old?

And finally, support the users with experienced data scientists, a wealth of self-service content, and a growing online user community. (Sorry, ran out of gun metaphors.)

Tuesday, October 6, 2020

Unlock AI For BI

Friday, October 2, 2020

Tuesday, September 15, 2020

displot() for univariate and bivariate distributions

Histograms with Seaborn displot()

Facetting with Seaborn displot()

Density plot with Seaborn’s displot()

ECDF Plot with Seaborn’s displot()

Bivariate KDE plot and Histogram with displot()

New features to Seaborn jointplot()

Monday, August 24, 2020

Determining ROI of AI