Intro Video
Showing posts with label Featured Blog Posts - Data Science Central. Show all posts
Showing posts with label Featured Blog Posts - Data Science Central. Show all posts

Tuesday, October 27, 2020

Digital Twin, Virtual Manufacturing, and the Coming Diamond Age

If you have ever had a book self-published through Amazon or similar fulfillment houses, chances are good that the physical book did not exist prior to the order being placed. Instead, that book existed as a PDF file, image files for cover art and author photograph, perhaps with some additional XML-based metadata indicating production instructions, trim, paper specifications, and so forth.

When the order was placed, it was sent to a printer that likely was the length of a bowling alley, where the PDF was converted into a negative and then laser printed onto the continuous paper stock. This was then cut to a precise size that varied minutely from page to page depending upon the binding type, before being collated and glued into the binding.

At the end of the process, a newly printed book dropped onto a rolling platform and from there to a box, where it was potentially wrapped and deposited automatically before the whole box was closed, labeled, and passed to a shipping gurney. From beginning to end, the whole process likely took ten to fifteen minutes, and more than likely no human hands touched the book at any point in the process. There were no plates to change out, no prepress film being created, no specialized inking mixes prepared between runs. Such a book was not "printed" so much as "instantiated", quite literally coming into existence only when needed.

It's also worth noting here that the same book probably was "printed" to a Kindle or similar ebook format, but in that particular case, it remained a digital file. No trees were destroyed in the manufacture of the ebook.

Such print on demand capability has existed since the early 2000s, to the extent that most people generally do not even think much about how the physical book that they are reading came into existence. Yet this model of publishing represents a profound departure from manufacturing as it has existed for centuries, and is in the process of transforming the very nature of capitalism.

No alt text provided for this image

Shortly after these printing presses came online, there were a number of innovations with thermal molded plastic that made it possible to create certain types of objects to exquisite tolerances without actually requiring a physical mold. Ablative printing techniques had been developed during the 1990s and involved the use of lasers to cut away at materials based upon precise computerized instruction, working in much the same that a sculptor chips away at a block of granite to reveal the statue within.

Additive printing, on the other hand, made use of a combination of dot matrix printing and specialized lithographic gels that would be activated by two lasers acting in concert. The gels would harden at the point of intersection, then when done the whole would be flushed with reagents that removed the "ink" that hadn't been fixed into place. Such a printing system solved one of the biggest problems of ablative printing in that it could build up an internal structure in layers, making it possible to create interconnected components with minimal physical assembly.

The primary limitation that additive printing faced was the fact that it worked well with plastics and other gels, but the physics of metals made such systems considerably more difficult to solve - and a great deal of assembly requires the use of metals for durability and strength. By 2018, however, this problem was increasingly finding solutions for various types of metals, primarily by using annealing processes that heated up the metals to sufficient temperatures to enable pliability in cutting and shaping.

What this means in practice is that we are entering the age of just in time production in which manufacturing exists primarily in the process of designing what is becoming known as a digital twin. While one can argue that this refers to the use of CAD/CAM like design files, there's actually a much larger, more significant meaning here, one that gets right to the heart of an organization's digital transformation. You can think of digital twins as the triumph of design over manufacturing, and data and metadata play an oversized role in this victory.

No alt text provided for this image

At the core of such digital twins is the notion of a model. A model, in the most basic definition of the word, is a proxy for a thing or process. A runway model, for instance, is a person who is intended to be a proxy for the viewer, showing off how a given garment looks. An artist's model is a stand-in or proxy for the image, scene, or illustration that an artist is producing. An architectural model is a simulation of how a given building will look like when constructed, and with 3D rendering technology, such models can appear quite life-like. Additionally, though, the models can also simulate more than appearance - they can simulate structural integrity, strain analysis, and even chemistry interactions. We create models of stars, black holes, and neutron stars based upon our understanding of physics, and models of disease spread in the case of epidemics.

Indeed, it can be argued that the primary role of a data scientist is to create and evaluate models. It is one of the reasons that data scientists are in such increasing demand, the ability to build models is one of the most pressing that any organization can have, especially as more and more of a company's production exists in the form of digital twins.

There are several purposes for building such models: the most obvious is to reduce (or in some cases eliminate altogether) the cost of instantiation. If you create a model of a car, you can stress test the model, can get feedback from potential customers about what works and what doesn't in its design, can determine whether there's sufficient legroom or if the steering wheel is awkwardly placed, can test to see whether the trunk can actually hold various sized suitcases or packages, all without the cost of actually building it. You can test out gas consumption (or electricity consumption), can see what happens when it crashes, can even attempt to explode it. While such models aren't perfect (nor are they uniform), they can often serve to significantly reduce the things that may go wrong with the car before it ever goes into production.

However, such models, such digital twins, also serve other purposes. All too often, decisions are made not on the basis of what the purchasers of the thing being represented want, but what a designer, or a marketing executive, or the CEO of a company feel the customer should get. When there was a significant production cost involved in instantiating the design, this often meant that there was a strong bias towards what the decision-maker greenlighting the production felt should work, rather than actually working with the stake-holders who would not only be purchasing but also using the product wanted. With 3D production increasingly becoming a reality, however, control is shifting from the producer to the consumer, and not just at the higher end of the market.

Consider automobile production. Currently, millions of cars are produced by automakers globally, but a significant number never get sold. They end up clogging lots, moving from dealerships to secondary markets to fleet sales, and eventually end up in the scrapyard. They don't get sold primarily because they simply don't represent the optimal combination of features at a given price point for the buyer.

The industry has, however, been changing their approach, pushing the consumer much closer to the design process before the car is actually even built. Colors, trim, engine type, seating, communications and entertainment systems, types of brakes, all of these and more can be can be changed. Increasingly, these changes are even making their way to the configuration of the chassis and carriage. This becomes possible because it is far easier to change the design of the digital twin than it is to change the physical entity, and that physical entity can then be "instantiated" within a few days of ordering it.

What are the benefits? You end up producing product upon demand, rather than in anticipation of it. This means that you need to invest in fewer materials, have smaller supply chains, produce less waste, and in general have a more committed customer. The downside, of course, is that you need fewer workers, have a much smaller sales infrastructure, and have to work harder to differentiate your product from your competitors. This is also happening now - it is becoming easier for a company such as Amazon to sell bespoke vehicles than ever before, because of that digitalization process.

This is in fact one of the primary dangers facing established players. Even today, many C-Suite managers see themselves in the automotive manufacturing space, or the aircraft production space, or the book publishing space. Yet ultimately, once you move to a stage where you have digital twins creating a proxy for the physical object, the actual instantiation - the manufacturing aspect - becomes very much a secondary concern.

Indeed, the central tenet of digital transformation is that everything simply becomes a publishing exercise. If I have the software product to build a car, then ultimately the cost of building that car involves purchasing the raw materials and the time on a 3D printer, then performing the final assembly. There is a growing "hobbyist' segment of companies that can go from bespoke design to finished product in a few weeks. Ordinarily the volume of such production is low enough that it is likely tempting to ignore what's going on, but between Covid-19 reshaping retail patterns, the diminishing spending power of Millennials and GenZers, and the changes being increasingly required by Climate Change, the bespoke digital twin is likely to eat into increasingly thin margins.

Put another way, existing established companies in many different sectors have managed to maintain their dominance both because they were large enough to dictate the language that described the models and because they could take advantage of the costs involved in manufacturing and production creating a major barrier to entry of new players. That's now changing.

No alt text provided for this image

Consider the first part of this assertion. Names are important. One of the realizations that has emerged in the last twenty years is that before two people or organizations can communicate with one another, they need to establish (and refine) the meanings of the language used to identify entities, processes, and relationships. An API, when you get right down to it, is a language used to interact with a system. The problem with trying to deal with intercommunication is that it is generally far easier to establish internal languages - the way that one organization defines its terms - than it is to create a common language. For a dominant organization in a given sector, this often also manifests as the desire to dominate the linguistic debate, as this puts the onus of changing the language (a timeconsuming and laborious process) into the hands of competitors.

However, this approach has also backfired spectacularly more often than not, especially when those competitors are willing to work with one another to weaken a dominant player. Most successful industry standards are pidgins - languages that capture 80-90% of the commonality in a given domain while providing a way to communicate about the remaining 10-20% that typifies the specialty of a given organization. This is the language of the digital twin, the way that you describe it, and the more that organizations subscribe to that language, the easier it is for those organizations to interchange digital twin components.

To put this into perspective, consider the growth of bespoke automobiles. One form of linguistic harmonization is the standardization of containment - the dimensions of a particular component, the location of ports for physical processes (pipes for fluids, air and wires) and electronic ones (the use of USB or similar communication ports), agreements on tolerances and so forth. With such ontologies in place, construction of a car's digital twin becomes far easier. Moreover, by adhering to these standards, linguistic as well as dimensional, you still get specialization at a functional level (for instance, the performance of a battery) while at the same time being able to facilitate containment variations, especially with digital printing technology.

As an ontology emerges for automobile manufacturing, this facilitates "plug-and-play" at a macro-level. The barrier to entry for creating a vehicle drops dramatically, though likely not quite to the individual level (except for well-heeled enthusiasts). Ironically, this makes it possible for a designer to create a particular design that meets their criterion, and also makes it possible for that designer to sell or give that IP to others for license or reuse. Now, if history is any indication, that will likely initially lead to a lot of very badly designed cars, but over time, the bad designers will get winnowed out by long-tail market pressures.

Moreover, because it becomes possible to test digital twins in virtual environments, the market for digital wind-tunnels, simulators, stress analyzers and so forth will also rise. That is to say, just as programming has developed an agile methodology for testing, so too would manufacturing facilitate data agility that serves to validate designs. Lest this be seen as a pipe dream, consider that most contemporary game platforms can, with very little tweaking, be reconfigured for exactly this kind of simulation work, especially as GPUs increase in performance and available memory.

The same type of interoperability applies not just to the construction of components, but also to all aspects of resource metadata, especially with datasets. Ontologies provide ways to identify, locate and discover the schemas of datasets for everything from usage statistics to simulation parameters for training models. The design of that car (or airplane, or boat, or refrigerator) is simply one more digital file, transmissible in the same way that a movie or audio file is, and containing metadata that puts those resources into the broader context of the organization.

The long term impact on business is simple. Everything becomes a publishing company. Some companies will publish aircraft or automobiles. Others will publish enzymes or microbes, and still others will publish movies and video games. You still need subject matter expertise in the area that you are publishing into - a manufacturer of pastries will be ill-equipped to handle the publishing of engines, for instance, but overall you will see a convergence in the process, regardless of the end-product.

How long will this process take to play out? In some cases, it's playing out now. Book publishing is almost completely virtual at this stage, and the distinction between the physical object and the digital twin comes down to whether instantiation takes place or not. The automotive industry is moving in this direction, and drone tech (especially for military drones) have been shifting this way for years.

On the other hand, entrenched companies with extensive supply chains will likely adopt such digital twins approaches relatively slowly, and more than likely only at a point where competitors make serious inroads into their core businesses (or the industries themselves are going through a significant economic shock). Automobiles are going through this now, as the combination of the pandemic, the shift towards electric vehicles, and changing demographics are all creating a massive glut in automobile production that will likely result in the collapse of internal combustion engine vehicle sales altogether over the next decade along with a rethinking of the ownership relationship with respect to vehicles.

Similarly, the aerospace industry faces an existential crisis as demand for new aircraft has dropped significantly in the wake of the pandemic. While aircraft production is still a very high-cost business, the ability to create digital twins - along with an emergence of programming ontologies that make interchange between companies much more feasible - has opened up the market to smaller, more agile competitors who can create bespoke aircraft much more quickly by distributing the overall workload and specializing in configurable subcomponents, many of which are produced via 3D printing techniques.

No alt text provided for this image

Construction, likewise, is dealing with both the fallout due to the pandemic and the increasing abstractions that come from digital twins. The days when architects worked out details on paper blueprints are long gone, and digital twins of construction products are increasingly being designed with earthquake and weather testing, stress analysis, airflow and energy consumption and so forth. Combine this with the increasing capabilities inherent in 3D printing both full structures and custom components in concrete, carbon fiber and even (increasingly) metallic structures. There are still limitations; as with other large structure projects, the lack of specialized talent in this space is still an issue, and fabrication units are typically not yet built on a scale that makes them that useful for onsite construction.

Nonetheless, the benefits make achieving that scaling worthwhile. A 3D printed house can be designed, approved, tested, and "built" within three to four weeks, as opposed to six months to two years for traditional processes. Designs, similarly, can be bought or traded and modified, making it possible to create neighborhoods where there are significant variations between houses as opposed to the prefab two to three designs that tend to predominate in the US especially. Such constructs also can move significantly away from the traditional boxy structures that most houses have, both internally and externally, as materials can be shaped to best fit the design aesthetic rather than the inherent rectangular slabs that typifies most building construction.

Such constructs can also be set up to be self-aware, to the extent that sensors can be built into the infrastructure and viewscreens (themselves increasingly moving away from flatland shapes) can replace or augment the views of the outside world. In this sense, the digital twin of the instantiated house or building is able to interact with its physical counterpart, maintaining history (memory) while increasingly able to adapt to new requirements.

No alt text provided for this image

This feedback loop - the ability of the physical twin to affect the model - provides a look at where this technology is going. Print publishing, once upon a time, had been something where the preparation of the medium, the book or magazine or newspaper, occurred only in one direction - from digital to print. Today, the print resides primarily on phones or screens or tablets, and authors often provide live blog chapters that evolve in agile ways. You're seeing the emergence of processors such as FPGAs that configure themselves programmatically, literally changing the nature of the processor itself in response to software code.

It's not that hard, with the right forethought, to envision real world objects that can reconfigure themselves in the same way - buildings reconfiguring themselves for different uses or to adapt to environmental conditions, cars that can reconfigure its styling or even body shape, clothing that can change color or thermal profiles, aircraft that can be reconfigured for different uses within minutes, and so forth . This is reality in some places, though still piecemeal and one-offs, but the malleability of the digital twins - whether of office suites or jet engines - is the future of manufacturing.

The end state, likely still a few decades away, will be an economy built upon just-in-time replication and the importance of the virtual twin, where you are charged not for the finished product but the cost of the license to use a model, the material components, the "inks", for same, and the processing to go from the former to the latter (and back), quite possibly with some form of remuneration for recycled source. Moreover, as this process continues, more and more of the digital twin carries the burden of existence (tools that "learn" a new configuration are able to adapt to that configuration at any time). The physical and the virtual become one.

No alt text provided for this image

Some may see the resulting society as utopian, others as dystopian, but what is increasingly unavoidable is the fact that this is the logical conclusion of the trends currently at work (for some inkling of what such a society may be like, I'd recommend reading The Diamond Age by Neal Stevenson, which I believe to be very prescient in this regard).

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings


What Is Fintech? 

"Fintech" describes the new technology integrated into various spheres to improve and automate all aspects of financial services provided to individuals and companies. Initially, this word was used for the tech behind the back-end systems of big banks and other organizations. And now it covers a wide specter of finance-related innovations in multiple industries, from education to crypto-currencies and investment management. 

While traditional financial institutions offer a bundle of services, fintech focuses on streamlining individual offerings, making them affordable, often one-click experience for users. This impact can be described with the word "disruption" - and now, to be competitive, banks and other conventional establishments have no choice but to change entrenched practices through cooperation with fintech startups. A vivid example is Visa's partnership with Ingo Money to accelerate the process of digital payments. Despite the slowdown related to the Covid-19 epidemic, the fintech industry will recover momentum and continue to change the finance world's face.

Fintech users

Fintech users fall into four main categories. Such trends as mobile banking, big data, and unbundling of financial services will create an opportunity for all of them to interact in novel ways:

  1. B2B - banks and their business clients
  2. B2C - small enterprises and individual consumers

The main target group for consumer-oriented fintech is millennials - young, ready to embrace digital transformation, and accumulating wealth.

What needs do they have? According to the Credit Carma survey, 85% of millennials in the USA suffer from burnout syndrome and have no energy to think about managing their personal finances. Therefore, any apps that automate and streamline these processes have a good chance to become popular. They need an affordable personal financial assistant that can do the following 24/7:

  • Analyze spending behaviors, recurrent payments, bills, debts
  • Present an overview of their current financial situation
  • Provide coaching and improve financial literacy

What they expect to achieve:

  • Stop overspending (avoid late bills, do smart shopping with price comparison, cancel unnecessary subscriptions, etc.)
  • Develop saving habits, get better organized
  • Invest money (analyze deposit conditions in different banks, form an investment portfolio, etc.)

The fintech industry offers many solutions that can meet all these goals - not only on an individual but also on a national scale. However, in many countries, there is still a high percentage of unbanked people - not having any form of a bank account. According to the World Bank report, this number was 1.7 billion people in 2017. Mistrust to new technologies, poverty, and financial illiteracy are the obstacles for this group to tap into the huge potential of fintech. Therefore, businesses and governments must direct the inclusion efforts towards this audience as all stakeholders will benefit from it. Apparently, affordable and easy-to-get fintech services customized for this huge group of first-time users will be a big trend in the future.

Big Data, AI, ML in Fintech

According to an Accenture report, AI integration will boost corporate profits in many industries, including fintech, by almost 40% by 2035, which equals staggering $14 trillion. Without a doubt, Big Data technologies, such as Streaming Analytics, In-memory computing, Artificial Intelligence, and Machine Learning, will be the powerhouse behind numerous business objectives banks, credit unions, and other institutions strive to achieve:

  • Aggregate and interpret massive amounts of structured and unstructured data in real-time.
  • With the help of predictive analytics, make accurate future forecasts, identify potential problems (e.g., credit scoring, investment risks)
  • Build optimal strategies based on analytical reports
  • Facilitate decision-making
  • Segment clients for more personalized offers and thus increase retention.
  • Detect suspicious behavior, prevent identity fraud and other types of cybercrime, make transactions more secure with such technologies as face and voice recognition.
  • Find and extend new borrower pools among the no-file/thin-file segment, widely represented by Gen Z (the successors of millennials), who lack or have a short credit history.
  • Automate low-value tasks (e.g., such back-office operations as internal technical requests)
  • Cut operational expenses by streamlining processes (e.g., image recognition algorithms for scanning, parsing documents, and taking further actions based on regulations) and reducing man-hours.
  • Considerably improve client experience with conversational user interfaces, available 24/7, and capable of resolving any issues instantly. Conversational banking is used by many big banks worldwide; some companies integrate financial chatbots for processing payments in social media.


Digital or internet-only banks do not have brick-and-mortar branches and operate exclusively online. The word neobank became widely used in 2017 and referred to two types of app-based institutions - those that provided financial services with their own banking license and those partnering with traditional banks. Wasting time in lines and paperwork - this inconvenience is the reason why bank visits are predicted to fall to just four visits a year by 2022. Neobanks, e.g., Revolut, Digibank, FirstDirect, offer a wide range of services - global payments and P2P transfers, virtual cards for contactless transactions, operations with cryptocurrencies, etc., and the fees are lower than with traditional banks. Clients get support through in-app chat. Among the challenges associated with digital banking are higher susceptibility to fraud and lower trustworthiness due to the lack of physical address. In the US, the development of neobanks faced regulatory obstacles. However, the situation is changing for the better.

Smart contracts

A smart contract is a software that allows automatic execution and control of agreements between buyers and sellers. How does it work? If two parties want to agree on a transaction, they no longer need a paper document and a lawyer. They sign the agreement with cryptographic keys digitally. The document itself is encoded in a tamper-proof manner. The role of witnesses is performed by a decentralized blockchain network of computing devices that receive copies of the contract, and the code guarantees the fulfillment of its provisions, with all transactions transparent, trackable, and irreversible. This sky-high level of reliability and security make any fintech operation possible in any spot of the world, any time. The parties to the contract can be anonymous, and there is no need for other authorities to regulate or enforce its implementation.

Open banking

Open banking is a system that allows third parties to access bank and non-bank financial institutions data through APIs (application programming interfaces) to create a network. Third-party service providers, such as tech startups, upon user consent, aggregate these data through apps and apply them to identify, for instance, the best financial products, such as savings account with the highest interest rate. Networked accounts will allow banks to accurately calculate mortgage risks and offer the best terms to low-risks clients. Open banking will also help small companies save time with online accounting and will play an important role in fraud detection. Services like Mint require users to provide credentials for each account, although such practice has security risks, and data processing is not always accurate. ÀPIs are a better option as they allow direct data sharing without accessing login and password. Consumer security is still compromised, and this is one of the main reasons why the open banking trend hasn't taken off yet. Many banks worldwide cannot provide open APIs of sufficient quality to meet existing regulatory standards. There are still a lot of blind spots, including those related to technology. However, open banking is a promising trend. The Accenture report offers many interesting insights.

Blockchain and cryptocurrencies

The distributed ledger technology - Blockchain, which is the basis of many cryptocurrencies, will continue to transform the face of global finance, with the US and China being global adoption leaders. The most valuable feature of a blockchain database is that data cannot be altered or deleted once it has been written. This high level of security makes it perfect for big data apps across various sectors, including healthcare, insurance, energy, banking, etc., especially those dealing with confidential information. Although the technology is still in the early stages of its development and will eventually become more suited to the needs of fintech, there are already Blockchain-based innovative solutions both from giants, like Microsoft and IBM, and numerous startups. The philosophy of decentralized finance has already given rise to a variety of peer to peer financing platforms and will be the source of new cryptocurrencies, perhaps even national ones. Blockchain considerably accelerates transactions between banks through secure servers, and banks use it to build smart contracts. The technology is also growing in popularity with consumers. Since 2009, when Bitcoin was created, the number of Blockchain wallet users has reached 52 million. A wallet is a layer of security known as "tokenization"- payment information is sent to vendors as tokens to associate the transaction with the right account.


Regtech or regulation technology is represented by a group of companies, e.g., IdentityMind Global, Suade, Passfort, Fund Recs, providing AI-based SaaS solutions to help businesses comply with regulatory processes. These companies process complex financial data and combine them with information on previous regulatory failures to detect potential risks and design powerful analytical tools. Finance is a conservative industry, heavily regulated by the government. As the number of technology companies providing financial services is increasing, problems associated with compliance with such regulations also multiply. For instance, processes automation makes fintech systems vulnerable to hacker attacks, which can cause serious damage. Responsibility for such security breaches and misuse of sensitive data, prevention of money laundering, and fraud are the main issues that concern state institutions, service providers, and consumers. There will be over 2.6 billion biometric users of payment systems by 2023, so the regtech application area is huge.

In the EU, PSD2 and SCA aim to regulate payments and their providers. Although these legal acts create some regulatory obstacles for fintech innovations, the European Commission also proposes a series of alleviating changes, for instance, taking off the table paper documents for consumers. In the US, fintech companies must comply with outdated financial legislation. The silver lining is the new FedNow service for instantaneous payments, which is likely to be launched in 2023–2024 and provides a ready public infrastructure.


The insurance industry, like many others, needs streamlining to be more efficient and cost-effective and meet the demand of time. Insurtech companies are exploring new possibilities, such as ultra-customization of policies, behavior-based dynamic premium pricing, based on data from Internet-enabled devices, such as GPS navigators and fitness activity trackers, AI brokerages, on-demand insurance for micro-events, etc., through a new generation of smart apps. As we mentioned before, the insurance business is also subject to strict government regulations, and it requires close cooperation of traditional insurers and startups to make a breakthrough that will benefit everyone.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Cybersecurity Experts Discuss Company Misconception of The Cloud and More in Roundtable Discussion

Industry experts from TikTok, Microsoft, and more talk latest trends on cybersecurity & public policy.

Enterprise Ireland, Ireland’s trade and innovation agency, hosted a virtual Cyber Security & Public Policy panel discussion with several industry-leading experts. The roundtable discussion allowed cybersecurity executives from leading organizations to come together and discuss The Nexus of Cyber Security and Public Policy.

The panel included Roland Cloutier, the Global Chief Security Officer of TikTok, Ann Johnson, the CVP of Business Development - Security, Compliance & Identity at Microsoft, Richard Browne, the Director of Ireland’s National Cyber Security Centre, and Melissa Hathaway, the President of Hathaway Global Strategies LLC who formerly spearheaded the Cyberspace Policy Review for President Barack Obama and lead the Comprehensive National Cyber Security Initiative (CNCI) for President George W. Bush.

 Panelists discussed the European Cloud and the misconception companies have of complete safety and security when migrating to the Cloud and whether it is a good move for a company versus a big mistake. Each panelist also brought valuable perspective and experience to the table on other discussion topics including cyber security’s recent rapid growth and changes; the difference between U.S. and EU policies and regulations; who holds the responsibility for protecting consumer data and privacy; and more.

 “As more nations and states continue to improve upon cybersecurity regulations, the conversation between those developing policy and those implementing it within the industry becomes more important,” said Aoife O’Leary, Vice President of Digital Technologies, Enterprise Ireland. “We were thrilled to bring together this panel from both sides of the conversation and continue to highlight the importance of these discussions for both Enterprise Ireland portfolio companies and North American executives and thought leaders.”

 This panel discussion was the second of three events in Enterprise Ireland’s Cyber Demo Day 2020 series, inclusive of over 60 leading Irish cyber companies, public policy leaders, and cyber executives from many of the largest organizations in North America and Ireland.

 To view a recording of the Cyber Security & Public Policy Panel Discussion from September 23rd, please click here.


About Enterprise Ireland

Enterprise Ireland is the Irish State agency that works with Irish enterprises to help them start, grow, innovate, and win export sales in global markets. Enterprise Ireland partners with entrepreneurs, Irish businesses, and the research and investment communities to develop Ireland's international trade, innovation, leadership, and competitiveness. For more information on Enterprise Ireland, please visit

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

5 Steps to Collect High-quality Data

Obtaining good quality data can be a tough task. An organization may face quality issues when integrating data sets from various applications or departments or when entering data manually.

Here are some of the things a company can do to improve the quality of the information it collects:

1. Data Governance plan

A good data governance plan should not only talk about ownership, classifications, sharing, and sensitivity levels plus also follows in detail with procedural details that outline your data quality goals. It should also have the details of all the personnel involved in the process and each of their roles and more importantly a process to resolve/work through issues.

2. Data Quality Guidance

You should also have a clear guide to use when separating good data from bad data. You will have to calibrate your automated data quality system with this information, so you need to have it laid out beforehand.

3. Data Cleansing Process

Data correction is the whole point of looking for flaws in your datasets. Organizations need to provide guidance on what to do with specific forms of bad data and identifying what’s critical and common across all organizational data silos. Implementing a data cleansing manually is cumbersome as the business shifts, strategies dictate the change in data and the underlying process.

4. Clear Data Lineage

With data flowing in from different departments and digital systems, you need to have a clear understanding of data lineage – how an attribute is transformed from system to system interactions and provide the ability to build trust and confidence.

5. Data Catalog and Documentation

Improving data quality is a long-term process that you can streamline using both anticipations and past findings. By documenting every problem that is detected and associated data quality score to the data catalog, you reduce the risk of mistake repetition and solidify your data quality enhancement regime with time.

As stated above, there is just too much data out there to incorporate into your business intelligence strategy. The data volumes are building up even more with the introduction of new digital systems and the increasing spread of the internet. For any organization that wants to keep up with the times, that translates to a need for more personnel, from data curators and data stewards to data scientists and data engineers. Luckily, today’s technology and AI/ML innovation allow for even the least tech-savvy individuals to contribute to data management at the east. Organizations should leverage these analytics augmented data quality and data management platforms like to recognize immediate ROI and longer cycles of implementation.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Insights from the free state of AI repost

For the last few years, I have read the free state of AI report

Here are the list of insights which I found interesting

The full report and the download link is at the end of this article


AI research is less open than you think: Only 15% of papers publish their code


Facebook’s PyTorch is fast outpacing Google’s TensorFlow in research papers, which tends to be a leading indicator of production use down the line


PyTorch is also more popular than TensorFlow in paper implementations on GitHub


Language models: Welcome to the Billion Parameter club

Huge models, large companies and massive training costs dominate the hottest area of AI today, NLP.


Bigger models, datasets and compute budgets clearly drive performance

Empirical scaling laws of neural language models show smooth power-law relationships, which means that as model performance increases, the model size and amount of computation has to increase more rapidly.


Tuning billions of model parameters costs millions of dollars

Based on variables released by Google et al., you’re paying circa $1 per 1,000 parameters. This means OpenAI’s 175B parameter GPT-3 could have cost tens of millions to train. Experts suggest the likely budget was $10M.


We’re rapidly approaching outrageous computational, economic, and environmental costs to gain incrementally smaller improvements in model performance

Without major new research breakthroughs, dropping the ImageNet error rate from 11.5% to 1% would require over one hundred billion billion dollars! Many practitioners feel that progress in mature areas of ML is stagnant.


A larger model needs less data than a smaller peer to achieve the same performance

This has implications for problems where training data samples are expensive to generate, which likely confers an advantage to large companies entering new domains with supervised learning-based models.


Even as deep learning consumes more data, it continues to get more efficient

Since 2012 the amount of compute needed to train a neural network to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months.


A new generation of transformer language models are unlocking new NLP use-cases

GPT-3, T5, BART are driving a drastic improvement in the performance of transformer models for text-to-text tasks like translation, summarization, text generation, text to code.


NLP benchmarks take a beating: Over a dozen teams outrank the human GLUE baseline

It was only 12 months ago that the human GLUE benchmark was beat by 1 point. Now SuperGLUE is in sight.


What’s next after SuperGLUE? More challenging NLP benchmarks zero-in on knowledge

A multi-task language understanding challenge tests for world knowledge and problem solving ability across 57 tasks including maths, US history, law and more. GPT-3’s performance is lopsided with large knowledge gaps.


The transformer’s ability to generalise is remarkable. It can be thought of as a new layer type that is more powerful than convolutions because it can process sets of inputs and fuse information more globally.

For example, GPT-2 was trained on text but can be fed images in the form of a sequence of pixels to learn how to autocomplete images in an unsupervised manner.


Biology is experiencing its “AI moment”: Over 21,000 papers in 2020 alone

Publications involving AI methods (e.g. deep learning, NLP, computer vision, RL) in biology are growing >50% year-on-year since 2017. Papers published since 2019 account for 25% of all output since 2000.


From physical object recognition to “cell painting”: Decoding biology through images

Large labelled datasets offer huge potential for generating new biological knowledge about health and disease.


Deep learning on cellular microscopy accelerates biological discovery with drug screens

Embeddings from experimental data illuminate biological relationships and predict COVID-19 drug successes.


Ophthalmology advances as the sandbox for deep learning applied to medical imaging

After diagnosis of ‘wet’ age-related macular degeneration (exAMD) in one eye, a computer vision system can predict whether a patient’s second eye will convert from healthy to exAMD within six months. The system uses 3D eye scans and predicted semantic segmentation maps.



AI-based screening mammography reduces false positives and false negatives in two large, clinically-representative datasets from the US and UK

The AI system, an ensemble of three deep learning models operating on individual lesions, individual breasts and the full case, was trained to produce a cancer risk score between 0 and 1 for the entire mammography case. The system outperformed human radiologists and could generalise to US data when trained on UK data only.


Causal reasoning is a vital missing ingredient for applying AI to medical diagnosis

Existing AI approaches to diagnosis are purely associative, identifying diseases that are strongly correlated with a patient’s symptoms. The inability to disentangle correlation from causation can result in suboptimal or dangerous diagnoses.


Model explainability is an important area of AI safety: A new approach aims to incorporate causal structure between input features into model explanations

A flaw with Shapley values, one current approach to explainability, is that they assume the model’s input features are uncorrelated. Asymmetric Shapley Values (ASV) are proposed to incorporate this causal information.



Reinforcement learning helps ensure that molecules you discover in silico can actually be synthesized in the lab. This helps chemists avoid dead ends during drug discovery.

RL agent designs molecules using step-wise transitions defined by chemical reaction templates.

American institutions and corporations continue to dominate NeurIPS 2019 papers

Google, Stanford, CMU, MIT and Microsoft Research own the Top-5.



The same is true at ICML 2020: American organisations cement their leadership position

The top 20 most prolific organisations by ICML 2020 paper acceptances further cemented their position vs. ICML 2019. The chart below shows their Publication Index position gains vs. ICML 2019.


Demand outstrips supply for AI talent

Analysis of US data shows almost 3x more job postings than job views for AI-related roles. Job postings grew 12x faster than job viewings in the last from late 2016 to late 2018.

US states continue to legislate autonomous vehicles policies

Over half of all US states have enacted legislation to related to autonomous vehicles.



Even so, driverless cars are still not so driverless: Only 3 of 66 companies with AV testing permits in California are allowed to test without safety drivers since 2018

The rise of MLOps (DevOps for ML) signals an industry shift from technology R&D (how to build models) to operations (how to run models)

25% of the top-20 fastest growing GitHub projects in Q2 2020 concern ML infrastructure, tooling and operations. Google Search traffic for “MLOps” is now on an uptick for the first time.



As AI adoption grows, regulators give developers more to think about

External monitoring is transitioning from a focus on business metrics down to low-level model metrics. This creates challenges for AI application vendors including slower deployments, IP sharing, and more:


Berkshire Grey robotic installations are achieving millions of robotic picks per month

Supply chain operators realise a 70% reduction in direct labour as a result.



Algorithmic decision making: Regulatory pressure builds

Multiple countries and states start to wrestle with how to regulate the use of ML in decision making.



GPT-3, like GPT-2, still outputs biased predictions when prompted with topics of religion

Example from the GPT-3 (left) and GPT-2 (right) with prompts and the model’s predictions, which contain clear bias. Models trained on large volumes of language on the internet will reflect the bias in those datasets unless their developers make efforts to fix this. See our coverage in State of AI Report 2019 of how Google adapted their translation model to remove gender bias.

Free download link is at state of ai report

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

How Kids Channel Their Internal Data Scientist to Become Candy Optimization Machines on Halloween

Ghostly greetings!

I believe everyone is born with the innate, curiosity-driven, explore-test-learn Data Science capability. At Halloween, kids naturally embrace a rapid exploration, rapid testing, failure-empowering “Scientific Method” to optimize their candy yield and logistical “Trick or Treating” algorithms.

So, what can we – as parents and teachers – provide to help nurture these budding data scientists? How can we prepare them for a future using data and analysis (analytics) to make informed operational, policy and life decisions?

Don’t be a scaredy cat and let’s talk about how we can get our kids ready for the future – by preparing them to embrace their inner data scientist.

Teach Your Students How to Use the Hypothesis Development Canvas

The Hypothesis Development Canvas is a design tool that succinctly defines the problem that one is trying to solve. The Hypothesis Development Canvas is a collaborative tool that captures the details about the hypothesis or problem that we are trying to solve, brainstorms the metrics and variables against which progress and success will be measured, identifies the stakeholders who either impact or are impacted by the targeted hypothesis, identifies and prioritizes the decisions that the stakeholders need to make in support of the targeted hypothesis (see Figure 2).

Figure 2: Halloween “Treat or Treating” Candy Optimization  Hypothesis Development Canvas

Having your students construct a Hypothesis Development Canvas for their Trick or Treating objectives is a great way to help our future data scientists understand the importance of preparation before actually putting science to the data. The Hypothesis Development Canvas in Figure 2 provides a “paint by the numbers” example for our future data scientists to thoroughly understand what they are trying to achieve, how they will measure success and how they can leverage data and analysis to optimize their key decisions to optimize their Halloween “Treat or Treating” endeavor. This canvas helps clarify the following before actually diving into the analysis that drives the event optimization, including:

  • What is your Halloween candy gathering objectives? For example: “To gather and retain as much high-quality candy, within the allotted time period, as possible.”
  • What are the metrics against which you will measure candy gathering progress and success? For example: “Maximize candy quality, optimize candy volume, minimize effort exerted to gather candy, minimize distance covered to gather candy.”
  • Who are your key stakeholders who can help you achieve your objectives? For example: “Friends, parents, neighbors, siblings.”
  • What are the key decisions that you need to make? For example:
    • What outfit are you going to wear?
    • What neighborhoods and residences are you going to target?
    • When to start out and how long to go?
    • With which friends are you going? (Be sure to leave your skeleton friend at home because he’s got no-body to go with.)
    • What candies to your keep for yourself?
    • What candies are offered up for the “Dad Tax”?
    • What treats (raisins, apples) do you off load to your younger siblings?
  • What data might one want to use to help optimize the above decisions? For example:
    • Last Year’s Yield by Residence or Store
    • New Neighbors
    • Neighborhood Construction
    • Weather
    • Day of the Week (school night versus non-school night)
    • Friends’ Neighborhood Recommendations
    • Traffic
    • Local Events

Note: one of the most important outcomes from the Hypothesis Development Canvas exercise is 1) the identification of the variables and metrics against which hypothesis progress and success will be measured, and 2) the identification, validating, valuation and prioritization of the key decisions that they need to make in support of the targeted hypothesis. Get these two items right, and your students are well down the path to becoming data scientists and serving up Frankenstein his favorite kind of potatoes: monster-mashed!

Kids’ Halloween Candy Optimization in Action!

Children are naturally able to optimize across multiple, sometimes conflicting variables – volume of candy, quality of candy, distance to travel between sources of candy, time to wait at the door to get their candy – in order to optimize their candy gathering decisions. So, while we as parents see a traditional neighborhood map such as Figure 3…

Figure 3: Traditional Neighborhood Map

…our children are applying their innate data science (data and analysis) skills to map out the candy gathering targets and their logistical paths that they believe will yield the best results given the metrics against which they will measure progress and success (see Figure 4).

Figure 4: Optimized Candy Gathering Logistical Map

Kids’ Halloween Candy Optimization Homework Assignment

One last thing to help our future data scientists is a simple but effective homework assignment.  In this exercise, we want to 1) help our students get comfortable optimizing across different metrics while 2) performing some rudimentary analytics to create a “score” that tells them the best neighborhoods to target for their candy optimization journey.

Figure 5 provides a simple spreadsheet that is designed to help students get comfortable playing with the data and the decision variable weights in order to make an informed decision about what neighborhoods they should target for their “Treat or Treating” venture.

Figure 5: Rudimentary Neighborhood Scoring Algorithm

To calculate the Neighborhood Candy Gathering Optimization Score in the last column of Figure 5, the students need to do the following (indicated in red in Figure 5):

  1. Enter the names of their potential target Neighborhoods.
  2. Next, enter a weight for the relative importance of them of each of the 3 different variables (Variable 1: Amount of Candy, Variable 2: Quality of Candy, and Variable 3: Time to Gather Candy). We use a scale of 1 to 10 where 10 is your most important variable and 1 is your least important variable.

    Note: Not all variables are of equal weight. Part of the data science process is making trade-offs between the weights assigned to the different variables. Because there probably isn’t an equal difference between the importance of the variables, feel free to use the full range of 1 to 10 to make a relative determination of the value of each variable vis-à-vis each other.
  3. Finally, for each neighborhood, enter a weight for how well you think that particular neighborhood does vis-à-vis each variable. For example: for Variable 1 (Amount of Candy), I felt that Mid Town and South Side would yield the highest volume of candy based upon previous experience and recommendations from friends (so both got 8’s out of 10), while I felt that Old Town would probably yield the lowest volume of candy based upon previous experience and recommendations from friends (so I gave Old Town a 2 out of 10).

Allow the students to play with the weights on the Variables and the Neighborhoods to see the impact that each has on the resulting Candy Optimization Score in the final column of the spreadsheet. 

Extra credit: ask them what data they might want to gather in order to help them make even more informed, accurate weighting decisions.

Finally, the spreadsheet from Figure 5 can be pulled off of Google Docs:

Extra, extra credit: What do you get when you divide the circumference of your Jack-o’Lantern by its diameter?

Did you answer, Pumpkin Pi?  Hehehe


Kids are natural data scientists; they have the natural curiosity to leverage data and basic analysis to make more informed decisions.  But what are we as parents and teachers doing to nurture that innate, curiosity-driven, explore-test-learn Data Science capability.  Help them by introducing them to a structured way to perform basic analysis – using the Hypothesis Development Canvas – and watch their natural curiosity, creativity and innovation cycle kick in.

In closing, I ‘witch’ you a Happy Halloween and have fun “Trick or Treating”, you crazy data scientists you!

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Saturday, October 24, 2020

So You Want to Write for Data Science Central

You're a writer on AI, ML, or various and sundry other data-oriented TLAs, and you'd like to write an article for Data Science Central. Great! This article is for you. Becoming a blogger on DSC is a good way to promote your proficiency in the field, to get the word out about interesting topics, or to gain the respect of your peers.

The mechanics of publishing on Data Science Central are straightforward:

  • If you are not already a member of DSC, set up an account on Data Science Central or one of its related accounts (we're working on single sign-on, but we're not quite there yet).
  • Wait for approval (you should receive an email letting you know when we add you to the membership roster.
  • Once you are a member, log in, then select Home/Add Blog Post from the menu at top.
  • You can use the WYSIWYG editor for content, or slip into HTML Editor mode. We hope to add support for Markdown and code display blocks early next year, but for now, you're somewhat limited in working with code.
  • Save your draft with the Save Draft button. When you're ready to publish, select the Publish button.
  • All content is moderated. This means that if the editor decides not to publish your article, they will not publish your article. If they do, then it will be published, typically within 2-3 days. If you have any questions, please contact the editors (including me).
  • Data Science Central currently does not pay for content. On the other hand, the platform has more than half a million subscribers, so it is a fantastic place to post for exposure, and we do our best to promote content that we feel is worthwhile.

While the mechanics are important, it's also worth spending some time trying to understand what DSC is looking for in content:

  • First up: Topics. When DSC was a brand new site, way back in 2012, the term Data Science itself was very novel, and it usually meant people who were able to use a new breed of programming tools (most specifically R, but later Python), to do analytics work, in the wake of the Big Data and Hadoop revolution that was going on at the time. Data Science Central was a cool, pithy name for the site, and as interest in the field grew, so did DSC.
  • Coming up on a decade later, things have changed. Being a data scientist has overtaken programming as the wish list career topper that all aspiring nerds want to be when they grow up. Machine learning algorithms and convolutional neural networks are increasingly replacing traditional programming for a variety of activities, and data is becoming strategic within organizations rather than simply tactical.
  • To that end, what we at DSC are looking for are stories about data. This can include data analysis tools and modeling, neural networks and data storage and access strategies, modeling, and knowledge representation. It also includes the strategic uses of data, governance, provenance, quality and protection, visualization and creative data story-telling. We're also expanding into those areas of artificial intelligence that are critical to cognitive computing, knowledge graphs, mathematics, and science. Why? Because data science is as much about science as it is about algorithms. Finally, DSC will focus more on the implications of data transformations on businesses, government, manufacturing, society and the individuals within it.
  • We're looking for journalism. Some examples:
    • "How AI is transforming retail",
    • "Will GPT-3 win the Pulitzer Prize?",
    • "Data scientists and the political realm",
    • "Challenges of contact tracing in a post-COVID world",
    • "Penrose, Tiles and the Nobel Prize".
  • We're looking for in-depth technical articles -
    • "How to digitally transform a company",
    • "AI in the Browser",
    • "Deep Fakes and the Algorithms That Drive Them",
    • "Knowledge Graphs for Publishing".
  • We're looking for case studies
    • "Point of Failure: When AI Goes Rogue", "
    • Wrangling Drones",
    • "What happened to the Self-Driving Car?".
  • Finally, we're looking for thought leadership -
    • "Where Do We Go From Here",
    • "Trolley Ethics",
    • "Who Really Benefits From AI?".
  • We're looking for you to put on your teacher's hat, your prognosticator's hat, your analyst's hat, and tell us the about the world that YOU see.

All this being said, it's also worth understanding what we're NOT looking for.

  • We are not looking for marketing pieces. If your product has an interesting toolset and you can dig into how to work with that toolset to solve complex problems, we might consider it, but we're more likely to send our sales-people to you to talk about ways that we can benefit one another (that's beyond of this editor's paygrade ... thankfully).
  • We are not yet posting advertisements for jobs (or posting resumes, for that matter). This is not to say we aren't considering doing this, but like everything else, there's complexity in the implementation. If you have questions about either of these, feel free to contact the editors directly, and we can talk. We also have educational promotional products for universities and private institutions.
  • Similarly, if you have events that you want to promote, talk to the editors. With the pandemic, we're awash in virtual seminars and conference notifications, but they do have value to the community.
  • We occasionally do webinars and interviews, though in general this is likely to be something that we will handle directly. If you LIKE to interview others, either via video or digital print, please contact the editors.
  • If you are the author of a book that you'd like to promote in the data science or knowledge engineering space, contact the editors.

Finally, a bit on style - things that will make your editors all tingly with delight, rather than awash in apathy.

  • We prefer original content. If you have a paper on ArXiv, for instance,write up a story that summarizes the importance of that content, in more readable and less academic terms. If you want to repost elsewhere, you can do so, but we generally do not repost to existing articles not on the site unless they are exceptional, and that's usually at the editor's discretion, not the writer's.
  • DSC is NOT a peer-reviewed journal. We welcome code and data samples (especially as we migrate to a new platform) but ultimately your audience is going to likely be technically proficient but not necessarily deep experts. As a rule of thumb - write to a tenth-grade audience, not a post-doctoral one. 
  • We LIKE pictures. Diagrams, illustrations, photos, the whole worth a thousand words thing. However, if you do use pictures, make sure you have the rights to them. Our lawyers get unhappy when we have to speak with the other guy's lawyers. While on the current platform it's a bit awkward to do, we would also like to start including a splash graphic at the top of the article, primarily to generate thumbnails.
  • Also, make sure you upload a good image of yourself when you are making your account. 
  • In general, we prefer articles that run between about 600 and 1800 words.
  • We're looking for professional writing - concise, easy to read, broken up into clear paragraphs.
  • Identify your name, your title, and work or school affiliation.
  • If you can, include a three to six bullet point list (called Data Points) covering the highlights or takeaways for the article itself. If these correspond with section headers, even better, but try to provide to your reader something to make them want to read your article.
  • In general, DSC editors prefer fewer (or no) links to outside references, especially if they are promotional in nature. We also reserve the right to link to definition content on other TechTarget properties. If you do include links, try to make sure that they open in a separate pane (should be easier to do shortly), and in general such links should only appear at the bottom of the article, rather than inline (think footnotes)
  • Include a short bio at the bottom of your article. You can link to a personal website or linked in page in the bio.
  • Articles that are in draft form for longer than three months will be deleted.
  • If you wish for an editor to review your comment and give you feedback, add an [Editor] tag at the bottom of the article with your questions. This will be deleted once the article is ready for publication.
  • DSC makes no guarantees that it will publish an article once it has been submitted, though we will attempt to get back to you as quickly as possible about whether the article was accepted or not. 

Have fun, be creative, and take a chance. Welcome to Data Science Central.

Kurt Cagle
Community Editor
Data Science Central

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Sunday, October 4, 2020

On the Nature of Data, Flights of Birds, and New Beginnings

My name is Kurt Cagle. I am the new Community Editor for Data Science Central, or DSC as it is known by its fans.

I'm one of those fans. Twelve years ago, Vincent Granville and Tim Matteson created a new site devoted to a passion they had: Data Science. In 2012, the term data science, and the practitioners, data scientists, were just beginning to come into vogue, specifically referring to the growing importance of a role that had been around for some time, the erstwhile data analyst, with the idea being that this particular role was different from a traditional programmer's role, though it borrowed many of the same tools.

Traditionally, an analyst, any analyst, has been someone who looks at information within a specific subject domain and, from their analysis, can both identify why things are the way they are and to a certain extent predict where those same things will be in the future.

Analysts have been around for a long time, and have always had something of a mystical air to them. As an example, in early Imperial Rome, there were a number of celebrated priests called Augurs who were said to be able to predict the future from the flight of birds. They had a surprisingly high success rate, and were usually in great demand by both military leaders needing strategic advice and merchants looking to better deploy their fleets and land agents. 

At first, the correlation between bird flight patterns and sound trade policy advice would seem low at best, but as with any good magical trick, it was worth understanding what was going on in the background. Why does one watch the sky for bird flight? Easy. Certain types of birds, such as homing pigeons, can carry messages from ships or caravans to various outposts, and from there such information can be relayed via both birds and other humans to central gathering points.  In other words, the Augurs had managed to build a very sophisticated, reasonably fast intelligence network tracking ships, troop movements, plague spots, and so forth, all under the cover of watching the skies for birds. Even today, the verb to augur means to predict, as a consequence.

In the eight years since Data Science Central published its first post, the field has grown up. Statistical and stochastic functions have become considerably more sophisticated. The battle royale between R and Python has largely been resolved as "it doesn't matter", as statistical toolsets make their way to environments as diverse as Scala, Javascript ,and C#. The lone data scientist has become a team, with fields as diverse as data visualization to neural network training to data storytellers staking their claims to the verdant soil of data analysis.

What is even more exciting is that this reinvention is moving beyond the "quants" into all realms of business, research, and manufacturing organizations. Marketing, long considered to be the least "mathematical" of disciplines within business, now requires at least a good grounding in statistics and probability, and increasingly consumes the lion's share of a company's analytics budget. Neural nets and reinforcement learning are now topics of discussion in the board room, representing a situation where heuristic or algorithmic tools are being supplemented or even replaced with models with millions or even billions of dimensions. The data scientist is at the heart of organizational digital transformations.

Let me bring this back to DSC, and give to you, gentle reader, a brief bio of me, and what I hope to be able to bring to Data Science Central. I have been a consulting programmer, information architect, and technological evangelist for more than thirty years. In that time I have written twenty-some-odd books, mainly those big technical door stoppers that look really good on bookshelves. I've also been blogging since 2003 in one forum or another, including O'Reilly Media, Jupiter Publishing, and Forbes. I spent a considerable amount of time trying to push a number of information standards  working with the W3C, and have, since the mid-2000s, focused a lot of time and energy on data representation, metadata, semantics, data modeling, and graph technology.

I'm not a data scientist. I do have a bachelor's degree in astrophysics, and much of a master's degree in systems theory. What that means is that I was playing with almost all of the foundational blocks of modern data science back about the time when the cutting edge processors were the Zylog-80 (known as the Z80) and 6502 chips within Apple II+ systems. I am, to put it bluntly, an old fart.

Yet when the opportunity to take over DSC came up, I jumped at it, for a very simple reason: context. You see, it's been my contention for a while that we are entering the era of Contextual Computing, eventually to be followed by Metaphorical Computing (in about twelve years, give or take a few). Chances are, you haven't heard the term Contextual Computing bandied about very much. It's not on Gartner's hype cycle, because it's really not a "technology" per se. Instead, you can think of contextual computing as the processing of, and acting on, information that takes place when systems have a contextual understanding of the world around them.

There are several pieces to contextual computing. Data Science is a big one. So is Graph Computing. Machine Learning, AI, the Internet of Things, the Digital Workplace, Data Fabric, Autonomous Drones, the list is long and getting longer all the time. These are all contextual - who are you, where are you, why are you here, what are you doing, why does it matter?

Data Science Central has become an authority in the world. My hope, my plan at this point, is to expand its focus moving into the third decade of this century. I'm asking you as readers, as writers, as community members, to join me on this journey, to help shape the nature of contextual computing. DSC is a forum to share technology but also to share asking deep questions about ethics and purpose, the greater good and with an eye towards opportunities. I hope to take Vincent and Tim's great community and build it out, with your help, observations, and occasional challenges.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Free book - Artificial Intelligence: Foundations of Computational Agents

Free book - Artificial Intelligence: Foundations of Computational Agents

There are many excellent free books on Python – but Artificial Intelligence: Foundations of Computational Agents is about  a subject not commonly covered

I found the book useful as a introduction to Reinforcement Learning

As the title suggests, the book is about computational agents

An agent observes the world and carries out actions in the environment. The agent maintains an internal state that it updates. Also, the environment takes in actions of the agents, and in turn updates it internal state and returns the percepts. In this implementation, the state of the agent and the state of the environment are represented using standard Python variables, which are updated as the state changes.

This structure can be used to model many interesting problems and is the focus of the book. Ultimately, it leads to Reinforcement Learning.

The book structure is

Chapter 3: Searching for Solutions

Chapter 4: Reasoning with Constraints

Chapter 5: Propositions and Inference

Chapter 6: Planning with Certainty

Chapter 7: Supervised Machine Learning

Chapter 8: Reasoning Under Uncertainty

Chapter 9: Planning with Uncertainty

Chapter 10: Learning with Uncertainty

Chapter 11: Multiagent Systems

Chapter 12: Reinforcement Learning

Chapter 13: Relational Learning

The authors website also has detailed slides

I found the work exceptional so I bought the book Artificial Intelligence: Foundations of Computational Agents

But there is a free version

You can download the book – code and other resources at Artificial Intelligence: Foundations of Computational Agents

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Data Detectives

It has become evident that developments in analytics are creating new occupations. There has been much discussion about where new jobs will come from with many existing ones being made redundant because of the 4th Industrial Revolution – i.e. the impact of artificial intelligence and robotics. Analytics is bucking this trend.


Some of new occupations in analytics include data prospectors and data harvesters. Data prospectors, like gold prospectors, are responsible for searching and locating data on the internet and other large data repositories. Data harvesters are responsible for extracting data and information from these sources. Data harvesters do this, for example, by web scraping. Staff who are highly skilled and knowledgeable in doing these functions are required - especially exploring something as vast and intricate as the internet.


Another new occupation is that of a data detective. They are analysts who find knowledge and insights in data. This may sound a simple and straight forward job to do.


It is suggested that there are plenty of analysts who can do extraction and cleaning tasks but have little or no aptitude for exploring data to find answers to difficult problems and issues and struggle to recognise important and informative discoveries. That is, they can perform the technical tasks of providing data but are not able to use it to find ‘nuggets of gold’ in this resource.


What is required are highly skilled professionals who, like police detectives, excel at analysis and problem solving. They need to be proficient in marshalling facts, following leads in data, testing hypotheses and hunches, joining the ‘dots’ and drawing conclusions from what is known. In short, they require the knowledge and skills of a Sherlock Holmes.


The primary skills required by data detectives are the ability to explore data and the ability to identify items of interest. They can do this by using the functionality of desktop packages such as Microsoft Excel and Microsoft Access and data visualization packages such as Tableau, QlikView and Power BI. They can also interrogate data using SQL with structured data and SPARQL with semantic data.


Where data detectives add value is that they ask informed questions to help to understand challenging and difficult problems and issues. They find workarounds when they hit difficulties and obstacles in obtaining the answers they require. They possess the nous, have the patience, and have the persistence to go the extra kilometre to find interesting patterns and trends in data.  


Three examples of where data detectives can add value include using risk-analysis tools to gain insights into threats and opportunities. They can take different data views of subjects and issues and where interesting patterns are found, they can make further inquiries to find more about what is going on and what their implications are when it comes to developments that can either harm or benefit individuals, organizations and the community.


The second example is stratifying a population to find interesting strata such as those with high incidence of a disease such as COVID-19. They can analyse cases in different strata to see why they have high infection rates and compare these with strata with low infection rates. These analyses can reveal what measures can be taken to lower the incidence of the disease.


The third example is analysing cases that have anomalies with insurance claims. Business rules can be written for those who show unusual patterns and the rules can be cascaded to find other people in the population who closely match them as they too may have issues with their claims.


It is suggested that data detective work needs to be recognized as a specialist skill where those with requisite attributes are selected, trained, and employed to do this work. Organizations need to take steps to identify those who are gifted in doing detective tasks and use their talents.


They complement data scientists who use mining and modelling techniques to extract knowledge from data. Data detectives are more qualitative in their approach while data scientists are more quantitative in their orientation. However, data detectives use the tools and procedures developed by data scientists to explore data such as using population partitioning techniques.


Data detectives can go the extra step of interpreting what data scientists find in data and can give context to what is discovered or detected.  For example, data scientists can produce a list of high-risk cases detected using a machine-learning model but often they cannot explain why they are classified in this manner. Data detectives can explore data to give context to the cases and explain why they were identified by the model. They can also spot false positives or cases that appear to be of concern but are false alarms and therefore do not warrant attention. This saves time, money and effort in that resources are not wasted pursuing them.   


Data detectives are part of the broader and growing family of occupations that deal with data. This family includes as examples data prospectors, data harvesters, data scientists, data analysts, data engineers, data architects, data brokers, data lawyers, data journalists, data artists, data quality officers and database managers. They each have a discrete and important role to perform and they all complement each other in making use of what is now referred as the new oil. Data is now the fuel that enables organizations to function and to deliver business outcomes.  


When it comes to formal education, there are now many masters programs in analytics in universities across the globe. These programs could be expanded to include different specialization streams to cater for these different data occupations cited above. That is, they become omnibus programs where students can select relevant subjects that enable them to specialize in data science or data engineering or data brokering or data detective work to use examples. These specializations are required to provide practicing analytics professionals to meet the diverse needs of government, industry, and commerce in the 21st Century.




Warwick Graco is a practicing data scientist and can be contacted at

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings

Introducing Analytics To A Product

With more and more people getting conversant in analytics, its demand in every field becoming more pronounced. Product is no different. It is almost inevitable to introduce analytics in some format or other in the products you own. However, before you open the gates to this world of magic, there are three questions you should try answering.

These three basic questions shall help in better planning for your analytics strategy and would act as a compass in times of uncertainty.


Why analytics:

Great that you have decided to embark on the journey - could be because of fear of missing the bandwagon. Nevertheless, without answering this question, your team would always be involved in directionless busy work. It relates to a simple finding of why you want to introduce analytics to product. As Mckinsey puts it, without the right question, the outcome would be marginally interesting but monetarily insignificant.

  1. Is it to enable end users of your product?
  2. Will it serve for your internal product intelligence?
  3. Is it because every other product has some flavor of analytics?
  4. Investors asking for it?
  5. Is it the next big strategy for the product roadmap?
  6. You have hired a data science team; you do not know what to do with?
  7. Would it provide a better selling proposition for the sales team?
  8. If your clients are asking for their usage statistics?

What in analytics:

Isn’t it obvious to ask yourself what you want to build, before you actually start building. Similar is true for analytics.

  1. Do you want to enable reporting of various metrics for admins of your B2B platform?
  2. Would you leverage AI/ML for a product feature for end users?
  3. Are you looking for more in-depth product intelligence?
  4. Is it a good to have feature, without much usability? OR it is going to be the prime feature offering?


Answer to the above questions, is a function of the product type, its intended use and the users.

How to analytics:

After addressing Why & What to offer in analytics, the logical next step is to plan how to deliver it. Moreover, when we are talking about analytics, data is the centerpiece. Data enforces the need of a completely new ecosystem of processes and practices to meet the regulatory, trust and demand obligations.

Although, every constituent of data management calls for a dedicated article, I shall touch briefly on each and try to illustrate how each influences the analytics strategy of the product.

Database systems: The primary infrastructure that would act as the cornerstone of the analytics strategy: database to store the data. RDBMS, NoSQL, or a Hybrid solution, followed by dozens of companies to choose from.

Master data and metadata management: This is the definition, the identity, the identifier, the reference via which every data call will be directed. It is essential to know and govern extensive data assets.

Quality control: You must have heard of the saying ‘garbage in, garbage out’. Bad data will severely hamper the trust and actionable knowledge in business operations. Data has to be unique, complete and consistent.

Integration definition: For analytics to be practical and actionable, data has to flow in from varied sources. This can be a transfer between different products or join between multiple modules within the platform. A schema is a map or viaduct that enable this unification.

Warehouse: The transactional data or raw data stored from platform might not be ideally designed for analytics. Joining a dozen of tables on the fly would impact not just the throughput but also the very feasibility of insight generation. A purpose built data warehouse is an efficient step towards integrating data from multiple heterogeneous sources. However, this may lead to a near-real-time system with some delay in data availability.  

Transformation: Data transformation is an integral part of data integration or data warehousing, where the data is converted from one format/structure to other. It involves numeric/date calculation, string manipulation or rule based sequential data wrangling processes. As a step in ETL (extraction-transformation-load) data transformation cuts down the processing time for end user, thus enabling swift reporting and insight generation.

Governance: Sets the guiding principles, benchmarks, practices and rules for 1) Data policies 2) Data quality 3) Business policies 4) Risk management 5) Regulatory compliances 6) Business process management. Being an essential part of RFPs and government regulations, lack of data governance can expose company to lawsuits, higher data/process costs and complete business failure.

Architecture: According to the Data Management Body of Knowledge (DMBOK), Data Architecture “includes specifications used to describe existing state, define data requirements, guide data integration, and control data assets as put forth in a data strategy.” Simply said data architecture describes how data is collected, stored, transformed, distributed and consumed. Data architecture bridges business strategy and technical execution.


Without the power to derive of information and insights, storing data is of no use. Once the data management is in place, planning is required for processing and representing the data.

Collaboration vs in-house development: There are tons and tons of tools available in market that help making sense out of your data. These can be traditional BI tools like Power BI/ Tableau/ Qlik/ Microstrategy that help make dashboards. Or, there are modern BI tools like Looker/ Periscope/ Chartio which go beyond just dashboarding. Then there are tools like Amplitude/ Firebase/ Google Analytics/ Mixpanel/ Moengage which help with product analytics and understanding user behavior. These tools easily integrate with your product and provide faster go-to-market for your analytics offering. However there is cost associated with this – 1) steep recurring subscription cost 2) lesser control on features. An alternative could be developing the reporting and dashbaording tool in-house. It does come with a very long gestation/development period and operational issues of larger teams to manage. However, these can be off-set by the cost savings and superior control.

Real time or delayed: With business needs driving this decision, comparison could be made between consuming transactional data (real time analytics) or warehouse data (near real time) for analytical purposes. A warehouse definitely has an edge, providing more flexibility and scope but real time reporting has its own charm.

AI/ML: Artificial intelligence and Machine learning are the latest buzzwords, with something as humble as macro automation being classified as AI/ML. However, the sincere AI/ML solutions enable the product a proposition of differentiated offering along with the essential value add for the end users. The only concern with AI/ML implementation is that of cost. Whether human talent or infrastructure, it does not come cheap. Not to forget, the much-needed patience and trust that needs to be invested by the leaders. Hence, the agreement has to be a well thought though business decision, rather than a hasty push from IT department.


Essentially,  rather than a knee-jerk reaction, a well thought out plan – considering demand, capabilities, resources, company’s management, business, legal and regulations – ensures the analytics implementation a definite success.

from Featured Blog Posts - Data Science Central
via Gabe's MusingsGabe's Musings