An Introduction to Data Science

By Vidyaratnam Ganapathy



A graph comparing quarterly profits

What is Data Science?

Data Science is a field that combines the techniques of computer science to collect data and extract information from it. This information is then used to form inferences and make predictions, which are applied to myriad domains. Data science is an interdisciplinary area of study and derives from many disparate fields, like mathematics, statistics, and computer science. Data-driven study has been credited as the fourth paradigm of science (after empirical, theoretical and computational), by Jim Gray, a Turing Award winning computer scientist.



A time series data graph - Hema Murthy

Data Science and Machine Learning

Data science is also an important part of another discipline, machine learning. In machine learning, algorithms are built based on pre-existing data, known as “training data”. The algorithm observes the cause and effect of each data point, and makes connections between them. These machine learning algorithms continuously improve themselves, learning with experience and improving prediction outcomes through the use of data. Machine Learning and data science differ in one key aspect, their principal objective. Data science draws inferences from a sample that can be applied to the population that contains the sample, while in machine learning, the main aim is to find generalized patterns that can be used to predict future outcomes.

Statistics, Machine Learning and Artificial Intelligence

A branch of Machine Learning has expanded beyond finding generalized patterns, and towards the study of intelligent agents. Intelligent agents are those which perceive their surroundings, and can autonomously take actions to maximize their chances of achieving a goal. Artificial Intelligence aims to replicate human intelligence (also called natural intelligence) in machines. One approach towards creating AI is statistical learning techniques. This involves the use of statistical models like hidden Markov models and Bayesian decision theory to serve as the base of the AI, instead of the earlier symbol manipulation. These statistical AI are used to solve specific problems.

Applications Of Data Science

Data science has applications in various domains, and is often used to optimize processes and increase efficiency.

  • In Sports

Data Science has a long and rich history in sports. Baseball has perhaps the most famous example of data science and statistics being used to create ground-breaking philosophies. In 1977, the baseball statistician Bill James published the first edition of his book, The Bill James Baseball Abstract. This book served as a starting point for the analytics movement in baseball. In the next few decades, Billy Beane, the general manager of the Oakland A’s, would popularize the use of analytics to gain a competitive advantage. Beane was so successful that his work was the subject of a movie, Moneyball.

  • In Healthcare

Healthcare has had some of the most intensive applications of data science, used to track the spread of diseases, project cancerous tumour metastasis and predict menstrual cycles. Google has developed a tool to identify and predict the spread of breast cancer tumours to the lymph nodes and other areas of the body. The tool, named LYNA (Lymph Node Assistant), uses machine learning, cataloguing the outcomes of previous cases and predicting the results of future cases.

In Germany, the Clue app was founded in 2013, and tracks women’s menstrual cycles and fertility window. The Clue app collects data, and the algorithms, created using tools like Python and the visualization app Jupyter Notebook, predict the menstrual cycle.

  • In Security

2D Facial Recognition software used by the government and for phone security is another aspect that uses data science. Faces are stored as a set of nodal points, which include the distance between the eyes, the width of the nose, the depth of the eye sockets, the shape of the cheekbones, and the length of the jaw line. These points are measured using code called a faceprint, and are stored as numeric values. These are then compared with each other to recognize people.

Another method of facial recognition is the 3D facial recognition software. The 3D model captures a 3-dimensional picture of a person and much like 2D facial recognition, uses their distinctive features, the areas where rigid tissue and bone are most apparent, to store an image of a person. In 3D facial recognition software, there are more nodal points than in 2D software, which provide higher accuracy in recognition.

Methods for Data Analysis

There are many types of data analysis, which are used in different cases and with different forms of data.

  • Cluster Analysis

Cluster analysis is the process of grouping data elements together by similarity. There is no single target variable in cluster analysis, so this method finds hidden patterns and provides a context to trends in a dataset. Cluster analysis is used to find patterns among a particular demographic, usually to optimize customer experience. Cluster analysis is also used in medicine to recognize and distinguish between specific types of tissue. Imaging scans are used to map tissue, and cluster analysis is used to identify similar tissue groupings.

  • Regression Analysis

Regression analysis is used to find the impact of one or more independent variables on a set of dependent variables. Analysis with one independent is called linear regression, while analysis with multiple independents is called multiple regression. In regression analysis, past trends between variables are analysed and used to predict future developments. This method of analysis is used to detect correlations between factors and find areas of improvement.

  • Data Mining

In the data mining process, large amounts of data are analysed to generate knowledge. Exploratory statistical methods are used to identify trends, dependencies, data correlations, and inferences. Data mining is usually used in machine learning and artificial intelligence. Using data mining, initial patterns are established by the intelligence, providing the base on which the intelligence grows.

  • Neural Networks

Neural Networks are the basis for the growth of artificial intelligence. Neural Networks mimic the neurons in a human brain, and process data like humans would with minimal intervention. Similar to human brains, neural networks grow and make new connections with experience. These are generally used in predictive analytics and other AI models.

Data Science is one of the most interconnected subjects, with roots in many areas of study, with a wide range of domain specific applications. It is one of the fastest growing fields and the way of the future.



References:

  1. Bonsor, K., Johnson, R. (2001) How Facial Recognition Systems Work. How Facial Recognition Systems Work | HowStuffWorks. Retrieved: 4 August 2021.

  2. Rice, M. (2021) 17 Data Science Application and Examples.17 Top Data Science Applications & Examples You Should Know 2021 | Built In. Retrieved: 6 August 2021.

  3. Calzon, B. (2021) The Science of Snowflakes Explained. What Is Data Analysis? Methods, Techniques, Types & How-To (datapine.com). Retrieved: 10 August 2021