Data Science Interview Questions
- What is marginal probability?
- What are the probability axioms?
- What is conditional probability?
- What is Bayes’ Theorem and when is it used in data science?
- Define variance and conditional variance.
- Explain the concepts of mean, median, mode, and standard deviation.
- What is SQL, and what does it stand for?
- What is the ER model in SQL?
- What is data transformation?
- What is data transformation?
- What are the main components of a SQL query?
- What is a primary key?
- How do you handle missing or NULL values in a database table?
- Describe the Bernoulli distribution.
- Explain the exponential distribution and where it’s commonly used.
- What is the significance level (alpha) in hypothesis testing?
- Define different types of SQL functions.
Q1)What is marginal probability?
A key idea in statistics and probability theory is marginal probability, which is also known as marginal distribution. With reference to a certain variable of interest, it is the likelihood that an event will occur, without taking into account the results of other variables. Basically, it treats the other variables as if they were “marginal” or irrelevant and concentrates on one.
Marginal probabilities are essential in many statistical analyses, including estimating anticipated values, computing conditional probabilities, and drawing conclusions about certain variables of interest while taking other variables’ influences into account.
Q2)What are the probability axioms?
The fundamental rules that control the behaviour and characteristics of probabilities in probability theory and statistics are referred to as the probability axioms, sometimes known as the probability laws or probability principles.
There are three fundamental axioms of probability:
Non-Negativity Axiom
Normalization Axiom
Additivity Axiom
Q3)What is conditional probability?
The event or outcome occurring based on the existence of a prior event or outcome is known as conditional probability. It is determined by multiplying the probability of the earlier occurrence by the increased lprobability of the later, or conditional, event.
Q4)What is Bayes’ Theorem and when is it used in data science?
The Bayes theorem predicts the probability that an event connected to any condition would occur. It is also taken into account in the situation of conditional probability. The probability of “causes” formula is another name for the Bayes theorem.
In data science, Bayes’ Theorem is used primarily in:
Bayesian Inference
Machine Learning
Text Classification
Medical Diagnosis
Predictive Modeling
When working with ambiguous or sparse data, Bayes’ Theorem is very helpful since it enables data scientists to continually revise their assumptions and come to more sensible conclusions.
Q5)Define variance and conditional variance.
A statistical concept known as variance quantifies the spread or dispersion of a group of data points within a dataset. It sheds light on how widely individual data points depart from the dataset’s mean (average). It assesses the variability or “scatter” of data.
Conditional Variance
A measure of the dispersion or variability of a random variable under certain circumstances or in the presence of a particular event, as the name implies. It reflects a random variable’s variance that is dependent on the knowledge of another random variable’s variance.
Q6)Explain the concepts of mean, median, mode, and standard deviation.
Mean: The mean, often referred to as the average, is calculated by summing up all the values in a dataset and then dividing by the total number of values.
Median: When data are sorted in either ascending or descending order, the median is the value in the middle of the dataset. The median is the average of the two middle values when the number of data points is even.
In comparison to the mean, the median is less impacted by extreme numbers, making it a more reliable indicator of central tendency.
Mode: The value that appears most frequently in a dataset is the mode. One mode (unimodal), several modes (multimodal), or no mode (if all values occur with the same frequency) can all exist in …
Q7)What is SQL, and what does it stand for?
SQL stands for Structured Query Language.It is a specialized programming language used for managing and manipulating relational databases. It is designed for tasks related to database management, data retrieval, data manipulation, and data definition.
Q8)What is the ER model in SQL?
The structure and relationships between the data entities in a database are represented by the Entity-Relationship (ER) model, a conceptual framework used in database architecture. The ER model is frequently used in conjunction with SQL for creating the structure of relational databases even though it is not a component of the SQL language itself.
Q9)What is data transformation?
The process of transforming data from one structure, format, or representation into another is referred to as data transformation. In order to make the data more suited for a given goal, such as analysis, visualisation, reporting, or storage, this procedure may involve a variety of actions and changes to the data. Data integration, cleansing, and analysis depend heavily on data transformation, which is a common stage in data preparation and processing pipelines.
Q10)What is data transformation?
The process of transforming data from one structure, format, or representation into another is referred to as data transformation. In order to make the data more suited for a given goal, such as analysis, visualisation, reporting, or storage, this procedure may involve a variety of actions and changes to the data. Data integration, cleansing, and analysis depend heavily on data transformation, which is a common stage in data preparation and processing pipelines.
Q11)What are the main components of a SQL query?
A relational database’s data can be retrieved, modified, or managed via a SQL (Structured Query Language) query. The operation of a SQL query is defined by a number of essential components, each of which serves a different function.
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
JOIN
Q12)What is a primary key?
A relational database table’s main key, also known as a primary keyword, is a column that is unique for each record. It is a distinctive identifier.The primary key of a relational database must be unique. Every row of data must have a primary key value and none of the rows can be null.
Q13)How do you handle missing or NULL values in a database table?
Missing or NULL values can arise due to various reasons, such as incomplete data entry, optional fields, or data extraction processes.
Replace NULL with Placeholder Values
Handle NULL Values in Queries
Use Default Values
Q14)Describe the Bernoulli distribution.
A discrete probability distribution, the Bernoulli distribution is focused on discrete random variables. The number of heads you obtain while tossing three coins at once or the number of pupils in a class are examples of discrete random variables that have a finite or countable number of potential values.
Q15)Explain the exponential distribution and where it’s commonly used.
The probability distribution of the amount of time between events in the Poisson point process is known as the exponential distribution. The gamma distribution is thought of as a particular instance of the exponential distribution. Additionally, the geometric distribution’s continuous analogue is the exponential distribution.
Common applications of the exponential distribution include:
Reliability Engineering
Queueing Theory
Telecommunications
Finance
Natural Phenomena
Survival Analysis
Q16)What is the significance level (alpha) in hypothesis testing?
A crucial metric in hypothesis testing that establishes the bar for judging whether the outcomes of a statistical test are statistically significant is the significance level, which is sometimes indicated as (alpha). It reflects the greatest possible chance of committing a Type I error, or mistakenly rejecting a valid null hypothesis.
Q17)Define different types of SQL functions.
SQL functions can be categorized into several types based on their functionality.
Scalar Functions
Aggregate Functions
Window Functions
Table-Valued Functions
System Functions
User-Defined Functions
Conversion Functions
Conditional Functions