Intro Video
Showing posts with label Featured Blog Posts - Data Science Central. Show all posts
Showing posts with label Featured Blog Posts - Data Science Central. Show all posts

Tuesday, November 10, 2020

Interesting Application of the Poisson-Binomial Distribution

While the Bernoulli and binomial distributions are among the first ones taught in any elementary statistical course, the Poisson-Binomial is rarely mentioned. It is however one of the simplest discrete multivariate distributions, with applications in survey analysis, see here. In this article, we are dealing with experimental / probabilistic number theory, leading to a more efficient detection of large prime numbers, with applications in cryptography and IT security. 

This article is accessible to people with minimal math or statistical knowledge, as we avoid jargon and theory, favoring simplicity.  Yet we are able to present original research-level results that will be of interest to professional data scientists, mathematicians, and machine learning experts. The data set explored here is the set of numbers, and thus accessible to anyone. We also explain computational techniques, even mentioning online tools, to deal with very large integers that are beyond what standard programming languages or Excel can handle. 

1. The Poisson-Binomial Distribution

We are all familiar with the most basic of all random variables: the Bernoulli.  If Y is such a variable, it is equal to 0 with probability p, and to 1 with probability 1 - p. Here the parameter p is a real number between 0 and 1. If you run n trials, independent from each other, and each with the same potential outcome, then the number of successes, defined as the number of times the outcome is equal to 1, is a Binomial variable of parameters n and p

Source for picture: here

If the trials are independent but a different p is attached to each of them, then this time the number of successes has a Poisson-binomial distribution. In short, let's say that we have n independent Bernoulli random variables Y1, ..., Yn respectively with parameter p1, ..., pn, then the number of successes X = Y1 + ... + Yn has a Poisson-binomial distribution of parameters p1, ..., pn and n. The exact probability density function is cumbersome to compute as it is combinatorial in nature, but a Poisson approximation is available and will be used in this article, thus the name Poisson-binomial

The first two moments (expectation and variance) are as follows:

The exact formula for the PDF (probability density function) involves an exponentially growing number of terms as n becomes large. For instance, P(X = n - 2) which is the probability that exactly two out of n trials fail, is given by the following formula:

For this reason, whenever possible, approximations are used. 

1.1. Poisson approximation

When the parameters pk are small, say pk < 0.1, then the following Poisson approximation applies. Let λ = p1 + ... + pn. Then for m = 0, ..., n, we have: 

When n becomes large, we can use the Central Limit Theorem to compute more complicated probabilities such as P(X > m), based on the Poisson approximation. See also the Le Cam theorem for more precise approximations. 

2. Case study: Odds to observe many primes in a random sequence

The 12 integers below were produced with special sequence described in the second example in this article . It  quickly produces a large volume of numbers with no small divisors.  How likely it is to produce such a sequence of numbers just by chance? The numbers q[5], q[6], q[7], q[12] have divisors smaller than 1,000 and the remaining eight numbers have no divisor smaller than N = 15,485,863. Note that N (the one-millionth prime) is the largest divisor that I tried in that test. 

Here is the answer. The probability for a large number x to be prime is about 1 / log x, by virtue of the Prime Number Theorem. The probability for a large number x to have no divisor smaller than N is

where the product is over all primes p  <  N and γ = 0.577215… is the Euler–Mascheroni constant. Here ρN ≈ 0.033913.  See here for an explanation of the equality on the left side. The right-hand formula is known as the Mertens theorem. See also here. The symbol ~ represents asymptotic equivalence. Thus the probability to observe 4 large numbers out of 12 having no divisor smaller than N is

Note that we used a binomial distribution here to answer the question. Also, the probability for x to be prime if it has no divisor smaller than N is equal to

For the above numbers q[1],⋯,q[12], the probability in question is not small. For instance, it is equal to 0.47,0.36,0.23 respectively for q[1], q[2], q[11]. Other sequences producing a high density of prime numbers are discussed here and here

2.1. Computations based on the Poisson-Binomial distribution

Let us denote as pk the probability that q[k] is prime, for k =1, ...,12. As discussed earlier in section 2, pk = 1 / log q[k] is small, and the Poisson approximation can be used when dealing with the Poisson-binomial distribution. So we can use the formula in section 1.1. with λ = p1 + ... + pn and n = 12. 

Thus, λ = 0.11920 (approx.) Now we can compute P(X = m) for m = 8, 9, 10, 11,12:

The chance that 8 or more large numbers are prime among q[1],⋯,q[12] is the sum of the 5 probabilities in the above table. It is equal to 9.1068 / 10^13. That is, less than one in a trillion. 

2.2. Technical note: handling very large numbers

Numbers investigated in this research have dozens and even hundreds of digits. The author has routinely worked with numbers with millions of digits. Below are some useful tools to deal with such large numbers.

  • If you use a programming language, check if it has a BigNum or BigInt library. Here I used the Perl programming language, with the BigNum library. A similar library is available in Python. See examples of code, here
  • A list of all prime numbers up to one trillion is available here
  • To check if a large number p is prime or not, use the command PrimeQ[p] in Mathematica, also available online here. Another online tool, allowing you to test many numbers in batch to find which ones are prime, is available here.
  • The online Sagemath symbolic calculator is also useful. I used it e.g. to compute millions of binary digits of numbers such as SQRT(2), see here
  • For those interested in experimental number theory, the OEIS online tool is also very valuable. If you discover a sequence of integers, and you are wondering if it has been discovered before, you can do a reverse lookup to find references to the sequence in question. You can also do a reverse lookup on math constants, entering the first 15 digits to see if it matches a known math constant.

3. Cryptography application 

Many cryptography systems rely on public and private keys that feature the product of two large primes, typically with hundreds or thousands of binary digits. Producing such large primes was not an easy task until efficient algorithms were created to check if a number is prime or not. These algorithms are know as primality tests. Some are very fast but only provide a probabilistic answer: the probability that the number in question is a prime number, which is either zero or extremely close to one. These algorithms rely on sampling a large number of primes to identify prime candidates, and then determine their status (prime or not prime) with an exact but more costly test. 

Remember that the probability for a random, large integer p to be prime, is about 1 / log p. So if you test 100,000 numbers close to 10^300, you'd expect to find 145 primes. Not a very efficient strategy. One way to improve these odds by an order of magnitude, is to pick up integers belonging to sequences that are prime-rich: such sequences can contain 10 times more primes than random sequences. This is where the methodology discussed here becomes handy. Such sequences are discussed in two of my articles: here and here

About the author:  Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent also founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent's articles and books, here.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

(With Images) Know everything about GANs (Generative Adversarial Network) in depth

Let’s understand the GAN(Generative Adversarial Network).

Generative Adversarial Networks were invented in 2014 by Ian Goodfellow(author of best Deep learning book in the market)  and his fellow researchers. The main idea behind GAN was to use two networks competing against each other to generate new unseen data(Don’t worry you will understand this further). GANs are often described as a counterfeiter versus a detective, let’s get an intuition of how they work exactly.

So we can think of counterfeiter as a generator. 

The generator is going to:

  • Receive random noise typically Gaussian or normal distribution of noise. 
  • And it is going to attempt to output the data often used for image data.

The discriminator:

  • Takes the data set consisting of real images from the real datasets and fake images from the generator.
  • Attempt to classify real vs fake images.

Keep in mind, regardless of your source of images whether it’s MNIST with 10 classes, the discriminator itself will perform Binary classification. It just tries to tell whether it’s real or fake.

So let’s actually see the process:

We first start with some noise like some Gaussian distribution of noise data and we feed directly into the generator. The goal of the generator is to create images that fool the discriminator.

In the very first stage of training, the generator is just going to produce noise.

And then we also grab images from our real dataset.

And then in PHASE1, we train the discriminator essentially labeling fake generated images as zeros and real data generated images as one. So basically zero if you are fake and one if you are real.

 We feed that into the discriminator and the discriminator gets trained to detect the real images versus the fake image. And then as time goes on the generator during the second PHASE of training is going to keep improving its images and trying to fool the discriminator, until it’s able to hopefully generate images that appear to mimic the real dataset and discriminator. Is no longer able to tell the difference between the false image and the real image.

So from the above example, we see that there are really two training phases:

  • Phase 1- Training Discriminator
  • Phase 2 - Train Generator

In phase one, what we do is we take the real images and we label them as one and they are combined with fake images from a generator labeled as zero. The discriminator then trains to distinguish the real images from fake images. Keep in mind that in phase one of training the backpropagation is only occurring on the discriminator. So we are only optimizing the discriminator’s weights during phase one of training.

Then in phase two, we have the generator produce more fake images and then we only feed the fake images to the generator with all the labels set as real. And this causes a generator to attempt to produce images that the images discriminator believes to be real. And what’s important to note here is that in phase two because we are feeding and all fake images labeled as 1, we only perform backpropagation on the generator weights in this step. So we are not going to be able to a typical fit call on all the training data as we did before. Since we are dealing with two different models(a discriminator model and generator model), we will also have two different phases of training.

What is really interesting here and something you should always keep in mind, the generators itself never actually sees the real images. It generates convincing images only based on gradients flowing back through the discriminator during its phase of training. Also, keep in mind the discriminator also improves as training phases continues, meaning the generated images will also need to hopefully get better and better in order to fold the discriminator.

This can lead to pretty impressive results. In the video, research has published many models such as style GANs and also a face GAN to actually produce fake human images that are extremely detailed. See below the example of face GAN performance from NVIDIA. IMPRESSIVE RIGHT????

Now let’s talk about difficulties with GANs networks,

  • Training Resources

Since GANs are more often used with image-based data and due to the fact that we have two networks competing against each other they require GPUs for reasonable training time. But fortunately, we have Google Collab with us to use GPUs for free.

  • Mode Collapses 

Often what happens is the generator figure out just a few images or even sometimes a single image that can fool the discriminator and eventually “collapses” to only produce that image. So you can imagine back where it was producing faces, maybe it figured out how to produce one single face that fools the discriminator. Then the generator ends up just learning to produce the same face over and over again.

So in theory it would be preferable to have a variety of images, such as multiple numbers or multiple faces, but GANs can quickly collapse to produce the single number or phase whatever the dataset happens to be regardless of the input noise. 

This means you can feed in any type of random noise you want but the generator figured out the one image that it can use to fool the discriminator.

It is typically better to avoid the mode collapse because they are more complex and they have deeper layers to them.

There are a couple of different ways to overcome this problem is by using DCGAN(Deep convolutional GAN, this I will explain in another blog).

Researchers have also experimented with what’s known as “mini-batch discrimination”, essentially punishing generated batches that are all too similar. So if the generator starts having mode collapse and getting batches of very very similar looking images, it will begin to punish that particular batch inside of discriminator for having the images be all too similar.

  • Instability

It can be difficult to ascertain performance and appropriate training epochs since all the generated images at the end of the day are truly fake. So it’s difficult to tell how well our model is performing at generating images because a discriminate thinks something is real doesn’t mean that a human-like us will think of a face or a number looks real enough. 

And again due to the design of a GAN, the generator and discriminator are constantly at odds with each other which leads to performance oscillation between the two.

So while dealing with GAN you have to experiment with hyperparameters such as the number of layers, the number of neurons, activation function, learning rates, etc especially when it comes to complex images.


  • GANs are a very popular area of research! And often that the results are so fascinating and so cool that researchers even like to do this for fun, so you will see a ton of different reports on all sorts of GANs. 
  • So I would highly encourage you to make a quick search on Google Scholar for the latest research papers on GANs. Trust me you will see a paper on this topic every month.
  • Highly recommend you to play with GANs and gave fun to make different things and show off on social media.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Monday, November 9, 2020

Three steps for conquering the last mile of analytics

Becoming insights-driven is now the ultimate prize of digital transformation, and many organizations are making significant progress toward this goal. However, putting insights into action – the “last mile” of analytics – is still a challenge for many organizations. 

With continued investments in data, analytics and AI, as well as the broader availability of machine-learning tools and applications, organizations have an abundance of analytical assets. Yet the creation of analytical assets should not be the only measure of success for organizations. In reality, deploying, operationalizing, or putting analytical assets into production should be the driver for how organizations are able to get value from their AI and data science efforts.

In a traditional data and analytics continuum, data is transformed into insights to support decision-making. If organizations want to break out from experimentation mode, avoid analytics assets becoming shelfware, and empower front lines to make analytics-powered decisions, they must start with decisions. Then they need to decide how to find, integrate and deliver the insights; and identify data to enable that.

These days, many organizations would argue they’re doing just that – they’ve hired analytics talent and appointed chief data officers (CDOs) or chief analytics officers (CAOs) to collaborate with business leaders to become more data- and analytics-driven. But many organizations are not seeing the desired impact and value from their data and analytics initiatives and are not able to quickly put their pilot projects into production.

According to IDC, only 35% of organizations indicate that analytical models are fully deployed in production. Difficulty in deploying and operationalizing analytics into systems or applications – and being consumed by downstream processes or people – is a key barrier to achieving business value.

Some might argue that the main focus within analytics projects has been on developing analytical recipes (e.g., data engineering, building models, merits of individual algorithms, etc.), while not much attention, priority or investment is done for operationalization of these assets. This is easier said than fixed. Data does not provide differentiation; decisions at scale do. Applying insights consistently to turn data into decisions will let organizations build a true software-led system of insights to grow and break away from competitors.

Check out the full article to learn how can organizations put analytics into action in a systematic, scalable manner and conquer the last mile.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Public Cloud: The Indispensable Component for Businesses

Every public Cloud has some essential features like- a scalable architecture, offering a pay-per-consumption model, instant reach to the masses and diversified component options. Usually, the migration-driven journey often depicts a standard architectural structure. Cloud can transform a basic prototype into reality, automate key processes and enhance customer experiences through real-time engagements containing in-depth insights. All these are only possible when the desired 'outcomes' have been seen as a key deliverable of the customer transformation journey.

Digital Transformation (DX), when implemented correctly, can drive the enhanced customer experience by elevating it from static platforms to dynamic user demands, with room for innovations. Today, various leading technology companies serve as the Digital Transformation Enabler for other organizations by offering a unique and customized set of services and offerings. All the technology giant organizations have expertise in data science and Big Data, which is used to extract quality information to enhance customer experience levels. These DX-enabler companies have altered the existing user ecosystem that supports the underlying value chain.

The opportunity for building a cloud-based architecture helps in facilitating not just basic technology. It also supports a broader range of assessments for users, processes and interactions that make them different by drawing benefits across the entire service lifecycle. Teams across a value chain can 'interact' with various Cloud services at different service lifecycle stages.

Effective Cloud Utilization

The effective utilization of Cloud technology accepts that various aiding technologies must align with automation and efficiency right from the core. Some ill alignment instances include aspects related to human-intervened approvals, defining virtual machines' budgets for an automated landscape.

This new arena of platform-based components has already started offering infinite scope since most of the leading Cloud service providers now function on multiple individual cloud parts. Thus, it is now becoming inevitably important that these functionalities are effectively planned for the desired output. Most of the older Cloud-based platforms are now headed towards older data centers on a physical level.

Business Benefits of Public Cloud

With public Cloud hosting, several tenants share the physical hardware. This makes it easier for the tenants to divide the infrastructure costs amongst various users. Most of the public Cloud providers offer their hosting on a pay-per-consumption, allowing the users to pay only for the consumed resources such as RAM, Memory Space, etc. With services being used on a consumption basis, it becomes easier for SMBs to leverage public Cloud services.

Other value-added benefits that businesses get by leveraging public Cloud hosting include-

  • Cost-Effective Solution

Public Cloud offers an extremely flexible billing structure. Most of the service providers are now offering Cloud hosting services on a pay-per-consumption basis, where the user is charged only for the utilized resources. The pay-per-consumption billing model helps in reducing additional overheads and such models can be extremely beneficial for small businesses, who are very stringent on their costs.

  • Quick Setup

Public Cloud can be easily setup in real-quick time. It can be accessed over the Internet and even deployed & configured in remote locations. The Cloud provider's IT team can easily configure the Public Cloud remotely for the client using an active Internet connection.

  • No Maintenance Needed

By availing the services of a Public Cloud provider, one doesn't need to worry about the maintenance of software, hardware, along with networks. Public Cloud providers also eliminate the need to maintain the infrastructure and concerns related to security and upgrades. Public Cloud hosting also allows businesses to manage the infrastructure with bare minimum staff and reduce overall operating costs.

  • Enhanced Agility

Today, every business demands to be quicker and dynamic in order to remain efficiently productive. To stay productive, businesses must constantly evolve and enhance their existing processes, tools and techniques. By being agile, companies can take faster decisions resulting in improved customer satisfaction levels. Through the public Cloud, businesses get simpler operations, better delivery and faster launch of new business programs and services.

  • Maximum Uptime

Most of the public Cloud hosting providers ensure a 99% Uptime, along with any chances of failure. As the entire Cloud system is connected through multiple servers, another server's instances of a server failure can be overlooked. This results in a smooth and continuous performance for major business-critical applications.


Concluding Words

Cloud has been in the technological space for some quite significant time now. It offers numerous benefits to businesses ensuring smoother business operations and increased focus on critical operational areas. Talking in particular about public Cloud hosting, it is shared by multiple tenants present on the server, which allows everyone to bear the hosting costs.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Sunday, November 8, 2020

AI and Employee experience| How does it positively go together?

Do you know 62% of workforce believe that AI will carry a favourable impact on their jobs?

And 67% say it is necessary to develop the skills for working in coordination with intelligent machines.

As per the research studies, AI will be the driving force behind cultural and economic shifts that will make the workforce more productive and agile. As AI plays a major role in the organisation of actively listening and understanding the perspective of employees. It allows the company to determine what exactly an employee wants and provide suitable information, such as career path recommendations and coaching. With AI algorithms, businesses can now formulate HR strategies that adapt to the particular needs of employees rather being limited by the resource planning of the HR team.

Today’s workforce not only looks forward to have a better workplace, but also expects to have a better brand story that adds credibility to their future career. However, working with a reputed brand in the market adds to the glory of employee experience. With AI tools, the company will be giving a first-hand impression to its employees, a personalised upskilling and experience of training, and an exceptional employee involvement process. These factors enhance the company opinions which leads to an overall improvement of the company brand.

Moreover, AI is also responsible to give minute details of HR processes that can grab the attention of HR managers. This way, the scope of productivity gets improved wherever possible. Not only this, AI plays a predominant role in many other HR processes.

Let's find out how AI can prove to be beneficial for the organisational team:

  • AI offers valuable insights on better interpersonal relationship among team members
  • AI is diligently skilful in handling mental health issues at the workplace
  • AI acts as a good interface in the recruitment process
  • AI improves the onboarding experience of an employee from the first day itself
  • AI promotes tasks automation
  • AI cultivates training paradigms amongst employee's learning prowess

However, AI shows the red flag to signal the HR manager wherever any team member faces difficulty in learning. This will, in turn, enables the HR manager to impart training exercises for improving those skills.

To sum up, AI enables the organisation to comply with HR processes so as to promote the smooth flow of productivity in a systematic manner and also improve the employee experience throughout their journey in the organisation.

Botomation with Tryvium desk empowers the organization to reinforce the AI dream of self-service for employee experience improvement. It gives the large organizations a platform to collaborate and become more interactive by adding value to MS Teams and Skype for the Business platform which focuses on enhancing the employee experience. It also integrates with major ITSM tools, Customer Support systems and CRMs available in the market. With the Tryvium desk, organizations can enhance first call resolution, improve agent productivity and reduce call handling time which makes the employee and their customers experience more powerful.  Botomation is one of our expertly developed applications combining AI and intelligent automation technologies to help companies with digital transformation. Whether it is customer support or employee request, Botomation can always provide near human experience with superior communication.

Many organizations have introduced Botomation in their system to optimize AI in a significant manner that interprets the action relating to business processes and communicates smoothly within the organization. This, in turn, improves operational efficiency and brings down the overhead costs by superseding the routine tasks of employees. It allows the employees to concentrate on higher-value processes which improve productivity by 86%. Botomation also upgrades the HR processes and extend the benefits to employees in the areas of performance review, monthly payroll and travel expense management.

To know more, download the ebook

Please feel free to contact our sales team at +1 732-283-0499.We will be more than happy to assist you. 

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

How do I select SVM kernels?

This article was written by Sebastian Raschka.

Given an arbitrary dataset, you typically don't know which kernel may work best. I recommend starting with the simplest hypothesis space first -- given that you don't know much about your data -- and work your way up towards the more complex hypothesis spaces. So, the linear kernel works fine if your dataset if linearly separable; however, if your dataset isn't linearly separable, a linear kernel isn't going to cut it (almost in a literal sense).

For simplicity (and visualization purposes), let's assume our dataset consists of 2 dimensions only. Below, I plotted the decision regions of a linear SVM on 2 features of the iris dataset.

To read the rest of the article, click here.


DSC Ressources

Follow us: Twitter | Facebook  

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Multi-stage heterogeneous ensemble meta-learning with hands-off demo

In this blog, I will introduce a R package for Heterogeneous Ensemble Learning (Classification, Regression) that is fully automated. It significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.

Before we dwell into the package details, let’s start with understanding a few basic concepts.

Why Ensemble Learning?

Generally, predictions become unreliable when the input sample is out of the training distribution, bias to data distribution or error prone to noise, and so on. Most approaches require changes to the network architecture, fine tuning, balanced data, increasing model size, etc. Further, the selection of the algorithm plays a vital role, while the scalability and learning ability decrease with the complex datasets. Combining multiple learners is an effective approach, and have been applied to many real-world problems. Ensemble learners combine a diverse collection of predictions from the individual base models to produce a composite predictive model that is more accurate and robust than its components. With meta ensemble learning one can minimize generalization error to some extent irrespective of the data distribution, number of classes, choice of algorithm, number of models, complexity of the datasets, etc. So, in summary, the predictive models will be able to generalize better.

How can we build models in more stable fashion while minimizing under-fitting/overfitting which is very critical to the overall outcome? The solution is ensemble meta-learning of a heterogeneous collection of base learners.

Common Ensemble Learning Techniques

The different popular ensemble techniques are referred to in the figure below. Stacked generalization is a general method of using a high-level model to combine lower- level models to achieve greater predictive accuracy. In the Bagging method, the independent base models are derived from the bootstrap samples of the original dataset. The Boosting method grows an ensemble in a dependent fashion iteratively, which adjusts the weight of an observation based on the past prediction. There are several extensions of bagging and boosting.

Image for post


metaEnsembleR is an R package for automated meta-learning (Classification, Regression). The functionalities provided includes simple user input based predictive modeling with the selection choice of the algorithms, train-validation-test split, model valuations, and easy guided unseen data prediction which can help the user’s to build stack ensembles on the go. The core aim of this package is to cater the larger audiences in general. metaEnsembleR significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.

Using metaEnsembleR

The package consists of the following components:

  • Ensemble Classifiers Training and Prediction

All these functions are very intuitive, and their use is illustrated with examples below covering the Classification and Regression problem in general.

Getting Started

The package can be installed directly from CRAN

Install from Rconsole: install.packages(“metaEnsembleR”)

However, the latest stable version (if any) could be found on Github, and installed using devtools package.

Install from GitHub:

if(!require(devtools)) install.packages(“devtools”)

devtools::install_github(repo = ‘ajayarunachalam/metaEnsembleR’, ref = ‘main’)




  • Training the ensemble classification model is as simple as one-line call to the ensembler.classifier function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data

ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_data.csv’))


unseen_new_data_testing iris[130:150,]

ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)

The above function returns the following, i.e., test data with the predictions, prediction labels, model result, and finally the predictions on unseen data.

testpreddata ← data.frame(ensembler_return[1])



#### Performance comparison #####

modelresult ← ensembler_return[3]


#### Unseen data ###

unseenpreddata ← data.frame(ensembler_return[4])


  • Training the ensemble regression model is the same as one-line call to the ensembler.regression function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data

house_price ←read.csv(file = ‘./data/regression/house_price_data.csv’)

unseen_new_data_testing_house_price ←house_price[250:414,]

write.csv(unseen_new_data_testing_house_price, ‘unseen_house_price_regression.csv’, fileEncoding = ‘UTF-8’, row.names = F)

ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_house_price_regression.csv’))


ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing_house_price )

The above function returns the following, i.e., test data with the predictions, prediction values, model result, and finally the unseen data with the predictions.

testpreddata ← data.frame(ensembler_return[1])

####  Performance comparison  #####

modelresult ← ensembler_return[3]


write.csv(modelresult[[1]], “performance_chart.csv”)

#### Unseen data  ###

unseenpreddata ← data.frame(ensembler_return[4])


demo classification




unseen_new_data_testing ← iris[130:150,]

write.csv(unseen_new_data_testing, ‘unseen_check.csv’, fileEncoding = ‘UTF-8’, row.names = F)

ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)

testpreddata ← data.frame(ensembler_return[1])



####Performance comparison#####

modelresult ← ensembler_return[3]


act_mybar ← qplot(testpreddata$actual_label, geom= “bar”)


pred_mybar ← qplot(testpreddata$predictions, geom= ‘bar’)


act_tbl ← tableGrob(t(summary(testpreddata$actual_label)))

pred_tbl ← tableGrob(t(summary(testpreddata$predictions)))

ggsave(“testdata_actual_vs_predicted_chart.pdf”,grid.arrange(act_tbl, pred_tbl))

ggsave(“testdata_actual_vs_predicted_plot.pdf”,grid.arrange(act_mybar, pred_mybar))

####unseen data###

unseenpreddata ← data.frame(ensembler_return[4])



demo regression



unseen_rock_data ← rock[30:48,]

ensembler_return ← ensembler.regression(rock[1:30,], 4,c(‘lm’), ‘rf’, 0.40, 0.30, 0.30, unseen_rock_data)

testpreddata ← data.frame(ensembler_return[1])

####Performance comparison#####

modelresult ← ensembler_return[3]


write.csv(modelresult[[1]], “performance_chart.csv”)

####unseen data###

unseenpreddata ← data.frame(ensembler_return[4])

Comprehensive Examples

More demo examples can be found in the Demo.R file, to see the results run Rscript Demo.R in the terminal.


If there is some implementation you would like to see here or add in some examples feel free to do so. You can always reach me at

Always Keep Learning & Sharing Knowledge!!!

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Recent Java enhancements for numeric calculations

In the past, slow evaluation of mathematical functions and large memory footprint were the most significant drawbacks of Java compared to C++/C for numeric computations and scientific data analysis. However, recent enhancements in the Java Virtual Machine (JVM) enabled faster and better numerical computing due to several enhancements in evaluating trigonometric functions.

In this article we will use the DataMelt ( for our benchmarks. Let us consider the following algorithm implemented in the Groovy dynamically-typed language shown below. It uses a large loop, repeatedly calling the sin() and cos() functions. Save these lines in a file with the extension "goovy" and run it in DataMelt:

import java.lang.Math
long then = System.nanoTime()
double x=0
for (int i = 0; i < 1e8; i++)
itime = ((System.nanoTime() - then)/1e9)
println "Time: " + itime+" (sec) result="+x

The execution of this Groovy code is typically a 10-20% faster than for the equivalent code implemented in Python:

import math,time
then = time.time()
for i in xrange(int(1e8)):
itime = time.time() - then
print("Time:",itime," (sec) result=",x)

Note that CPython2 (version 2.7.2) is a 20% faster than CPython3 (version 3.4.2), but both CPython interpreters are slower for this example than Groovy.

The same algorithm re-implemented in Java:

import java.lang.Math;
public class Example {
public static void main(String[] args) {
long then = System.nanoTime();
double x=0;
for (int i = 0; i < (int)1e8; i++)
double itime = ((System.nanoTime() - then)/1e9);
System.out.println("Time for calculations (sec): " + itime+"\n");
System.out.println("Pi = " + x +"\n");

and processed using DataMelt with OpenJDK13 further increases the execution speed by a factor 2 compared to the Groovy dynamic language.

Similar benchmarks of the Java code have been carried out by repeating the calculation using Java SE 8 ("JDK1.8") released in March 2014. The computation was about a factor 8 slower than for the OpenJDK13. This was due to less optimized code for evaluation of trigonometric functions in JDK1.8 (and earlier versions). This confirms significant improvements for numeric computations in the recent JVM compared to the previous releases.

The question of code profiling using different implementations is a complex problem, and we do not plan to explore all possible scenarios in this article. The main conclusion we want to draw in this section is that the processing speed of the code that implements mathematical functions for numeric calculations is substantially better for Groovy than for CPython2 (CPython3). The observed performance improvements in dynamically-typed languages implemented in Java are due to the recent enhancements in the modern JVMs, leading to a large factor in the speed of evaluations
of mathematical functions.

Sergei Chekanov

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Friday, November 6, 2020

DSC Friday News, 6 Nov 2020

The DSC News is published by Data Science Central, and highlights new content from our Weekly Digest. Previous editions can be found here. The contribution flagged with a + is our selection for the picture of the week. To subscribe, follow this link.

DSC Featured Articles

Articles From Tech Target

Picture of the Week

Source: article flagged with a + 

To make sure you keep getting these emails, please add to your address book or whitelist us.

Subscribe to our Newsletter | Comprehensive Repository of Data Science and ML Resources

Videos | Search DSC | Post a Blog | Ask a Question

Follow us on Twitter@DataScienceCtrl | @AnalyticBridge

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Mistakes that most of the analysts do in their analytics

In the field of data science or analytical sciences, any solution can be potential to solve any problem. But, to derive such prospective solutions, the following are the top common mistakes in their practice. 

  1. Unable to apply common sense and subject matter expertise for providing solutions to any given problem using analytical tools and techniques.
  2. The results are tremendously biased towards models such as AI and ML (tools and techniques), but incompetent to realize the significance of the problem. 
  3. The results tend to ignore basic statistical sampling techniques, that will not have enough evidence to have inference.
  4. Analytical solutions are providing inadequate or fragmented solutions; it is due to ignoring root causes such as proper investigation, profound analysis of any designed problem.
  5. Funded projects results are dominant and influenced towards predetermined results and try to convince their investors. These results can mislead them to a great extent in the long run as they did not derive from the data analysis.
  6. The quality of data is being questioned in every analysis, as within the organizational structure have different functionalities will not be willing to share its data with the analytics or data science teams. (as they see the confidentiality is their high priority, but unable to understand data is taking part in the inside employees for the better results). 
  7. If research results are baseless analytical results will always danger. Baseless arguments incur enormous cost and also lead to many adverse effects on the organizations or systems.
  8. Analysis should provide the results based on pieces of evidence from data. In many instances, analysts have the policy to promote acceptable and positives only. The major mistake by many analysts is to translate the negative numbers as acceptable or ignore the negative numbers. 
  9. Analysts should never try to mislead the end-users by promoting predetermined results with anticipation of their personal growth and future aspirations.
  10. As an analytics expert, it’s their responsibility and integrity to provide the best achievable and acceptable results to their clients. Instead, they are anticipating their personal brand and their own image. 
  11. Finally, with the above said mistakes, clients will have improper, undermined and biased results regarding to their business; hence their business decisions are not proper to the markets and also lead to the crisis in their business management.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Thursday, November 5, 2020

Java: What Makes it the Top Choice for Data Science

Data science uses algorithms, tools, and technologies to identify patterns in data and eventually glean valuable insights from the raw data. Simply put, it allows us to make sense of all that abundance of data that we seem to have all around us. We don’t have to tell you why data science has quickly gained popularity globally, with more and more companies and enterprises seeking to integrate it into their operations. It is by undertaking the development of relevant tools and software, of course. This, then, brings us to a vital question on this quest for data science: Which programming language to use?

Of course, there is absolutely no dearth of options here; Python, R, and many other programming languages that are widely used for data science. But there is one name that has carved quite a niche for itself in this particular market: Java. Despite being one of the oldest programming languages used for enterprise development, it has managed to not only keep up but stay ahead of the curve. It is thus successfully enabling the development of advanced apps and software. Why else do you think at least 21 percent of people working with data science use Java for it? It is because Java empowers with benefits such as interoperability, high levels of utility, etc. To help drive the point home, we have discussed some of Java's other data science services below.

  1. Quick pace: Admittedly, Python is one of the most widely used programming languages for data science. And yet, Java is 25x faster than Python! So, if you are going to build something that involves lots and lots of computing, Java is the right choice for you. Besides the processing speed, Java is also much quicker in the context of development time. It is due to its ability to integrate and facilitate the use of various tools that ease the development process to a significant extent compared to some of Java’s rivals.
  2. Plenty of frameworks: Of course, when the critical goal is data science, it always helps to have tools and features specifically aimed to ease that requirement, no? And Java does precisely that — it offers plenty of high-quality frameworks such as ND4J, Apache Mahout, Deeplearning4J, etc. Such frameworks render development work much easier and help programmers save precious time and money.
  3. Scalability: It is an understandably crucial consideration, is made effortless with Java. How? With a variety of components specifically aimed at addressing those concerns. As the processing requirements grow, it enables the seamless addition of new hardware, semi-transparent redistribution of load, etc.

Java may not be the most popular for data science in the world. However, it is quickly getting there, thanks to the countless benefits it delivers for anyone working with data science. Now, suppose you too want to take it up for your data science projects. In that case, you should get in touch with a trusted vendor forJava application development services.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Fueling Digital Transformation with Service Design

One of the industry’s great debates (bigger than regular Cap’n Crunch versus Cap’n Crunch with Crunchberries) is “What is Digital Transformation?”  Here’s my take:

Digital Transformation is the fundamental reinvention of an organization’s business model by synergizing advanced analytics (composable, reusable analytic modules) with empowered teams to create a customer journey-centric culture that is continuously-learning and adapting to new sources of customer, product and operational value.

To help organizations navigate their Digital Transformation journey, I introduced the Digital Transformation Value Creation Mapping in the blog “It’s Not Digital Transformation; It’s Digital “Business” Transformation! – Part I.”  The Digital Transformation Value Creation Mapping provides a framework to help organizations 1) identify, 2) codify and 3)scale new sources of customer, product and operational value within a 4) customer journey-centric continuously-learningand adapting (via advanced analytics and empowered humans) business and operational environment (see Figure 1).

Figure 1: Digital Transformation Value Creation Mapping

The starting point for the Digital Transformation Value Creation Mapping – identifying sources of value creation – requires a thorough and intimate knowledge of what your customers are trying to accomplish with respect to your products and/or services. As part of that effort, organizations must make the upfront investment to identify, validate, value and prioritize the customers’ sources of value as well as the impediments to their value realization.  And that detailed, holistic understanding of your customers’ “journey” should start well before they engage with your products and/or services, and ideally extends beyond that engagement as well.

And that’s where the concept of Service Design comes into play.

Role of Service Design in Digital Transformation

Service Design, a methodology out of the Design Thinking toolkit, is instrumental in ensuring that companies thoroughly and intimately understand the entirety of the holistic, end-to-end journeys of their different customer segments.

The Interaction Design Foundation defines Service Design as a “process where designers create sustainable solutions and optimal experiences for both customers in unique contexts and any service providers involved. Designers break services into sections and adapt fine-tuned solutions to suit all users’ needs in context—based on actors, location and other factors.[1]

The Nielsen Norman Group states that a Service Blueprint (similar to a Service Design) is a “diagram that visualizes the relationships between different service components — people, props (physical or digital evidence), and processes — that are directly tied to touchpoints in a specific customer journey.” See Figure 2.

Figure 2: Source: “Service Blueprints: Definition”[2]

Service Design provides that deep understanding of your customers’ value creation and associated impediments that serves as the foundation for any organization’s Digital Transformation.  And since Digital Transformation is about the “fundamental reinvention of an organization’s business model”, a key part of that business model reinvention, for many organizations, includes transitioning your customers to more engaging, value-based “Capabilities-as-a-service” (or Xaas) business model.

Understanding the Xaas Business Model

Xaas refers to the delivery of specific capabilities as a service. For example, many of you use Gmail or Yahoo! for email.  You don’t have to download and maintain an application on your laptop to get access to those email services.  Instead, you “consume” email capabilities (composing emails, reading and responding to emails, sending group emails, scheduling meetings, managing contacts) “as a service”.  The Xaas consumption model has the end user advantage of hiding the underlying technology complexities; the users only care about having access to their desired capabilities freed from worrying about the underlying technologies.

For example, Uber is leveraging their detailed understanding of consumers, drivers and traffic patterns to provide a new “transit-as-a-service” offering. The Transportation Authority of Marin (California) will pay Uber a subscription fee to facilitate requesting, matching and tracking of its high-occupancy vehicle fleet in support of local residents travel needs.

As more organizations rush to Xaas to find new and more predictable revenue streams, the lack of knowledge about how customers are using your products and services is striking.  And that’s when the trouble begins.

To be successful with an Xaas business model requires (see Figure 3):

  1. Superior consumer product usage insights (product usage tendencies, inclinations, affinities, relationships, associations, behaviors, patterns and trends).  Xaas providers must be able to quantify and predict where, how and under what conditions the product will be used and the load on that product across various usage scenarios.
  2. Superior product operational insights (product performance and product operational tendencies, inclinations, affinities, relationships, associations, behaviors, patterns and trends) to provide an “operational experience excellence” by reducing unplanned operational downtime and optimizing product performance across various usage scenarios (which requires use cases such as predictive maintenance, repair optimization, inventory cost reductions, logistics optimization, elimination of obsolete and excessive inventory, consumables inventory optimization, energy efficiencies, asset utilization, technician retention, remaining useful life, predicted salvage value, etc.).
  3. Superior data and instrumentation strategy to know (and quantify) what data is most important for what use cases and where to place sensors and other instrumentation and data capturing devices to capture that data so as to balance the operational costs associated with False Negatives and False Positives.

Figure 3: Ingredients for a Successful Xaas Business Model

Xaas business model profitability can only be optimized when you tightly integrate and synergize your design, data and analytics strategies.

Role of Service Design with an Xaas Business Model

As organizations transition to an “as-a-service” business models, it requires an entirely new “product management” and “engineering” mindset:

“When you engineer a capability as a product, then it’s the user’s responsibility to figure out how best to use that product. But when your design a capability as a service, then it’s the designers’ and engineers’ responsibility to ensure that the service is capable of being used effectively by the user. This understanding of how users use your capabilities impacts revenue (usage-based revenue model), pricing (to thoroughly understand the value of that capability so as not to over or under price the capability) and SLA support agreements (so as to properly price service agreements based again upon the value of the capability).”

To help organizations to create the data and analytics strategy necessary to support their Xaas business strategy, I created the Data Science Journey Map canvas that identifies the data science (data and analytic) requirements necessary to 1) identify the customer requirements (as identified from the Solution Design) and 2) codify the customer requirements into analytic models (see Figure 4).

Figure 4: Data Science Journey Map Canvas

Figure 4 shows the Data Science Journey Map for a Discrete Manufacturer who is trying to optimize inventory costs and inventory availability. The Data Science Journey Map decomposes each stage of the customer journey into key actions (or decisions), metrics against which progress and success will be measured, the different analytics that we will need to provide to support the journey, and what data might be useful in supporting the analytics.

Summary: Marrying Service Design with Data Science

As organizations pursue Digital Transformation, a key component to that effort is the transition to a customer-centric, value-focused “Capabilities-as-a-service” (Xaas) business model.  Analytics is T-H-E most important competency as companies push into those Xaas business models.  

Superior insights into consumer product usage patterns coupled with superior insights into product performance patterns enables Xaas providers to determine the optimal operational, pricing and customer service level agreements to ensure a viable and profitable Xaas business model. 

And if you haven't first mastered customer usage and product performance data management and advanced analytics...SURPRISE (and that’s not the good kind of surprise).

[1] “What is Service Design?”

[2] “Service Blueprints: Definition”

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Monday, November 2, 2020

A Process to minimize the gap between research and its applications

Human history is numerous pieces of evidence, doctrine, theologies through logic, and by extensive research in every field exists in the universe. 

There is a huge gap between research and its practical application. Studies should conduct with enormous responsibility with extreme efforts, but which has a very minimal effect on real-world problems. The majority of the research across the globe appears only in a few journals (s) /books / digital form or in a printed manifestation only that is forgotten by the real-world at the earliest and in the name of the new paradigm of investigate research they are re-inventing the wheel. A huge of this knowledge has been unused or never have the intention of using this with due diligence incomprehension or non-availability to the generations. Which is causing wasting valuable resources of an individual’s/organizations/ educational institutions and might be an enormous burden to their research and findings.

The previous research should be made available to the researchers across the globe to have unique research topics and to get valuable research findings. When any research conducted should not be based on individual preferences (or) benefits to them but keep humanity facing problems as a high priority and inclination to conduct studies.

The best and most resilient and most welcoming choice to reduce this gap is as follows:

  • It should be a need of the hour for global research institutions and corporate organizations should come together to co-operate and co-ordinate in conducting research
  • share common objectives, preferences, benefits and should encourage researchers to work on real-world issues
  • also, fund them to speed up the understanding and resolve the existing problems that the human race challenges in every aspect of their life and that can encourage humans to solve issues collectively
  • hence, establish global peace and transform the world as a universal safest place for living.

In this contemporary world, using the available technology, architecture, and computing power with artificial intelligence (AI) algorithms can effortlessly summarize the subsist research findings. It can have also help to find the research gaps and future research needs. These algorithmic findings are certainly guidance to the upcoming researchers to create their research objectives and motivate them to provide logic and enormous power to provide, investigate and find solutions where the to the problems in a better/optimal way to scale it up and implement across the globe.

Companies like Google, IBM Watson have already established their NLP tools to summarise the text information to a certain extent. Such companies can co-operate and co-ordinate with the research insinuations and then provide the unique methodologies which can be another level of the business model for the benefit of societies and companies as well.

It’s global researchers’ responsibility to avoid or not to spread the misleading research findings that are confusing to the entire communities across the globe; they should also stop the biased results. Many types of research have ushers to show their presence, which does not significantly have any impact on real-world issues. This habit should stop at the earliest or should able to use these capabilities, funds to find real solutions. And it’s also a huge responsibility to accept the bitter truth research findings and should also adopt.

Finally, exchange ideas and research findings will always help to build good societies for the human race around the globe.

I would be happy to receive your feedback or thought process to enhance this idea to be implemented. Please contact me @ or +91-7829033033 to discuss further.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings