Data Science is one of the most well-known and widely used technologies across the world today. Professionals in this industry are being hired by major corporations. Data Scientists are among the highest-paid IT professionals due to strong demand and limited availability. We have curated the top and most frequently asked questions in Data Science job interviews. Here’s a rundown of some of the most common Data Science interview questions:
1. What is Data Science, and how does it work?
Data Science is a branch of computer science that focuses on transforming data into information and collecting useful information from it. The kind of insights Data Science allows us to draw from existing data has led to some important advances in a variety of products and companies, which is why it is so popular. We can use these insights to determine a customer’s taste, the chance of a product succeeding in a given market, and so on.
2. What distinguishes Data Science from typical application programming?
- Firstly, Traditional application development takes a fundamentally different approach to designing systems that give value than data science.
- We used to examine the input, figure out the intended outcome, and write code that contained the rules and statements needed to change the provided input into the expected output in classic programming paradigms. As you can expect, these rules were not easy to write, especially for material that even computers couldn’t interpret, such as photographs and movies.
- This method is shifted slightly by data science. We’ll need enormous amounts of data to work with, including the necessary inputs and their mappings to the expected outcomes.
- Traditional application development takes a fundamentally different approach to designing systems that give value than data science. We used to examine the input, figure out the intended outcome, and write code that contained the rules and statements needed to change the provided input into the expected output in classic programming paradigms. As you can expect, these rules were not easy to write, especially for material that even computers couldn’t interpret, such as photographs and movies. This method is shifted slightly by data science. We’ll need enormous amounts of data to work with, including the necessary inputs and their mappings to the expected outcomes.
- Then we employ Data Science algorithms, which build rules based on mathematical analysis to map provided inputs to outputs. The term “training” refers to the process of generating rules. We use some data that was set aside before the training phase to test and check the system’s accuracy after it has been trained. We have no idea how the inputs are turned into outputs because the developed rules are a black box. However, if the precision is sufficient, we can employ the system (also called a model).
3. What is dimensionality reduction, and how does it work?
The process of reducing a dataset with a large number of dimensions (fields) to a dataset with fewer dimensions is known as dimensionality reduction. This is accomplished by removing parts of the dataset’s fields or columns. This isn’t done haphazardly, though. Only after ensuring that the remaining information is sufficient to succinctly express similar information are the dimensions or fields discarded in this process.
4. What exactly do you mean when you say linear regression?
The linear relationship between the dependent and independent variables can be better understood using linear regression. Linear regression is a supervised learning approach that aids in the determination of a linear relationship between two variables. The predictor, or independent variable, is one, and the response, or dependent variable, is the other. We strive to understand how the dependent variable varies in relation to the independent variable in Linear Regression. Simple linear regression is used when there is only one independent variable, and multiple linear regression is used when there are numerous independent variables.
5. What is a confusion matrix, and how does it work?
A confusion matrix is a table that is used to estimate a model’s performance. It uses a 22 matrix to tabulate the actual and anticipated values.
True Positive (d): This category includes all records in which both the actual and expected values are true. As a result, these are all of the true positives. False Negative (c): This category includes all records with true actual values but false anticipated values. False Positive (b): The actual values are incorrect, but the anticipated values are correct. True Negative (a): In this case, both the actual and anticipated values are false. So, if you want to get the correct numbers, you’d need to represent all of the true positives and true negatives with correct values. This is how the matrix of confusion works.
6. In Data Science, what is bias?
Bias is a form of inaccuracy that happens in a Data Science model when an algorithm is used that isn’t powerful enough to capture the underlying patterns or trends in the data. To put it another way, this error arises when the data is too complex for the algorithm to comprehend, causing it to develop a model based on simplistic assumptions. As a result of the underfitting, accuracy suffers. Linear regression, logistic regression, and other algorithms can cause a lot of bias.
7. Why is Python used in DS for data cleaning?
Data scientists must clean and transform massive data sets so they can work with them. It’s critical to remove meaningless outliers, malformed records, missing values, inconsistent formatting, and other redundant data for improved results.
Matplotlib, Pandas, Numpy, Keras, and SciPy are some of the most popular Python packages for data cleaning and analysis. These libraries are used to load and clean data so that effective analysis may be carried out. For instance, a CSV file named “Student” contains information about an institute’s students, such as their names, standard, address, phone number, grades, and marks.
8. What is the significance of R in data visualisation?
With over 12,000 packages in open-source sources, R is the best ecosystem for data analysis and visualisation. It has a large community, which means you can quickly find solutions to your difficulties on sites like StackOverflow.
It improves data management and facilitates distributed computing by dividing activities among numerous jobs and nodes, reducing the complexity and execution time of huge datasets.
9. What are the most widely used libraries in Data Science?
The following are some of the most widely used libraries for data extraction, cleaning, visualisation, and deployment of DS models:
- TensorFlow: Supports parallel computation and includes Google-backed library management.
- SciPy is mostly used to solve differential equations, multidimensional programming, data manipulation, and graph and chart display.
- Pandas is a programming language that is used to develop ETL (Extracting, Transforming, and Loading) capabilities in commercial systems.
- Matplotlib: Because it is free and open-source, it can be used to replace MATLAB, resulting in improved performance and lower memory usage.
10. How do you perform data analysis in Excel?
To perform data analysis in Excel, you can use various features and functions such as PivotTables, charts, and formulas. Here is a general process for performing data analysis in Excel:
- Prepare your data: Make sure your data is organized and cleaned up, and remove any unnecessary data or duplicates.
- Create a PivotTable: Select the data you want to analyze, and create a PivotTable. This allows you to summarize and analyze your data using various filters and criteria.
- Use charts: Use different types of charts to visualize your data, such as line charts, bar charts, and pie charts. This can help you identify trends and patterns in your data.
- Use formulas: Use Excel formulas such as SUM, AVERAGE, COUNT, MAX, and MIN to perform calculations and analyze your data. You can also use functions such as IF, VLOOKUP, and HLOOKUP to perform more complex analyses.
- Use data analysis tools: Excel has a variety of built-in data analysis tools such as regression analysis, data mining, and forecasting. These tools can help you perform more advanced analyses and predict future trends.
- Clean up your analysis: Once you have completed your analysis, clean up your data and charts to make sure they are easy to understand and visually appealing.
11. What is the role of Power BI in data science, and how do you use it to visualize data?
Power BI is a powerful business analytics tool from Microsoft that allows you to visualize and analyze data in a user-friendly way. Its role in data science is to help you create interactive visualizations and reports to gain insights from your data. Here are the steps to use Power BI to visualize data:
- Connect to your data: Power BI allows you to connect to various data sources, such as Excel spreadsheets, SQL Server databases, and cloud-based services like Azure.
- Create a dashboard: Once you have connected to your data, you can create a dashboard, which is a collection of visualizations and reports that give you an overview of your data.
- Choose your visualizations: Power BI provides a wide range of visualization options, such as bar charts, line charts, scatter plots, and maps. You can choose the best visualization for your data and customize it as needed.
- Add filters: Filters allow you to focus on specific parts of your data. Power BI provides various filter options, such as slicers, which allow you to filter data by a specific value or range.
- Create relationships: Power BI allows you to create relationships between different data sources, which enables you to combine data from multiple sources and analyze it together.
- Publish and share your dashboard: Once you have created your dashboard, you can publish it to the cloud and share it with others. You can also embed the dashboard in a website or application.
12. In a decision tree algorithm, what is entropy?
Entropy is a measure of impurity or unpredictability in a decision tree algorithm. The entropy of a dataset indicates how pure or impure the dataset’s values are. In simple terms, it informs us about the dataset’s volatility.
Let’s say we’re handed a box containing ten blue marbles. The entropy of the box is then 0 because it includes marbles of the same hue, implying that no impurity exists. If we have to pick a marble out of the box, the chances of it being blue are 1.0. When four blue marbles are replaced with four red marbles in the box, the entropy for pulling blue marbles increases to 0.4.
13. What is information gain in a decision tree algorithm?
At each step of the decision tree, we must develop a node that determines which feature to utilise to split data, i.e., which feature will best separate our data so that we can make predictions. Information gain, which is a measure of how much entropy is lowered when a specific characteristic is used to split the data, is used to make this selection. The feature that is used to separate the data is the one that provides the most information gain.
14. What is k-fold cross-validation, and how does it work?
We divide the dataset into k equal portions in k-fold cross-validation. Following that, we iterate over the full dataset k times. One of the k parts is utilised for testing and the other k 1 parts are used for training in each iteration of the loop. Each of the k sections of the dataset is used for training and testing purposes using k-fold cross-validation.
15. Describe the operation of a recommender system.
Many consumer-facing, content-driven web sites use a recommender system to provide recommendations for consumers from a library of available information. These systems make recommendations based on what they know about the users’ preferences from their platform activity.
Consider the case where we have a movie-streaming service comparable to Netflix or Amazon Prime. If a user has previously watched and appreciated films in the action and horror genres, it is safe to assume that the user like these genres. In that scenario, it would be wiser to suggest similar films to this person.
16. What is the definition of a normal distribution?
A graphical tool for analysing how data is spread out or dispersed is data distribution. Data can be disseminated in a number of ways. It could, for example, be skewed to the left or right, or it could all be mixed up. Data can also be spread around a central number, such as the mean, median, or standard deviation. This type of distribution is in the shape of a bell-shaped curve and has no bias to the left or right. The mean of this distribution is also the same as the median. A normal distribution is the name for this type of distribution.
17. What is Deep Learning, and how does it work?
Deep Learning is a type of Machine Learning in which neural networks are used to mimic the structure of the human brain, and machines are taught to learn from the information that is presented to them in the same way that a brain does. Deep Learning is a more advanced form of neural networks that allows computers to learn from data. The neural networks in Deep Learning are made up of multiple hidden layers (thus the name “deep” learning) that are connected to each other, with the output of the previous layer feeding into the current layer.
18. What exactly is a recurrent neural network (RNN)?
A recurrent neural network, or RNN for short, is an artificial neural network-based Machine Learning technique. They are used to discover patterns in a series of data, such as time series, stock market data, temperature data, and so on. RNNs are a type of feedforward network in which data from one layer is passed to the next and each node conducts mathematical operations on the data. RNNs maintain contextual information about earlier calculations in the network, therefore these operations are temporal. It’s named recurrent because it repeats operations on the same data each time it’s passed.
19. Explain the concept of selection bias.
The bias that happens during data sampling is known as selection bias. When a sample is not representative of the population being studied in a statistical study, this type of bias occurs.
20. What is a ROC curve, and how does it work?
Receiver Operating Characteristic is the abbreviation for Receiver Operating Characteristic. It’s essentially a plot of a true positive rate vs a false positive rate that aids us in determining the best tradeoff between the true positive rate and the false positive rate for various probability thresholds of anticipated values. As a result, the model is better if the curve is closer to the upper left corner. To put it another way, the curve with the most area under it is the best model.
For more details visit – Google Data Engineer Exam
21. What do you mean when you say “decision tree”?
A decision tree is a supervised learning method that can be used to classify and predict data. As a result, the dependent variable in this example can be both a numerical and a category value.
Each node represents an attribute test, each edge represents the attribute’s result, and each leaf node represents the class label. So, in this scenario, we have a succession of test circumstances, each of which leads to a different final decision.
22. What exactly do you mean when you say “random forest model”?
It combines numerous models to produce the final result, or, to be more exact, multiple decision trees to produce the final output. As a result, decision trees are the foundation of the random forest.
23. Can you explain how Microsoft Cognitive Services can be used in data science?
Microsoft Cognitive Services is a set of pre-built APIs, SDKs, and services that developers can use to add intelligent features to their applications. These intelligent features are powered by machine learning and artificial intelligence algorithms, and they can be used in various scenarios, including data science. Here are some ways Microsoft Cognitive Services can be used in data science:
- Text Analytics: Microsoft Cognitive Services provides APIs that allow you to extract key phrases, entities, and sentiment from text data. This can be useful in data science applications that involve analyzing large amounts of text data, such as social media posts or customer feedback.
- Computer Vision: Microsoft Cognitive Services provides APIs that allow you to analyze and interpret visual data, such as images and videos. This can be useful in data science applications that involve image or video analysis, such as object detection, facial recognition, or image classification.
- Speech Services: Microsoft Cognitive Services provides APIs that allow you to transcribe speech to text, translate speech in real-time, and even synthesize speech from text. This can be useful in data science applications that involve analyzing or generating spoken language data, such as call center analytics or speech-to-text transcription.
- Anomaly Detection: Microsoft Cognitive Services provides APIs that allow you to detect anomalies in time series data. This can be useful in data science applications that involve analyzing sensor data or financial data.
24. What is the difference between data modelling and database design?
It can be thought of as the first step toward database design. Data modelling is the process of constructing a conceptual model based on the relationships between different data models. Moving from the conceptual stage to the logical model to the physical schema is part of the process. It entails a methodical approach to data modelling approaches.
The process of designing a database is known as database design. The database design generates a complete data model of the database as an output. Database design, strictly speaking, refers to the detailed logical model of a database, but it can also refer to physical design decisions and storage characteristics.
25. How do you train and deploy a machine learning model in Azure Machine Learning Studio?
Azure Machine Learning Studio is a cloud-based platform that provides a drag-and-drop interface for building, training, and deploying machine learning models. Here are the steps to train and deploy a machine learning model in Azure Machine Learning Studio:
- Prepare data: Before you can train a model, you need to prepare the data. Azure Machine Learning Studio provides various tools for data preparation such as data cleaning, feature engineering, and data normalization.
- Create a new experiment: Once the data is prepared, you can create a new experiment in Azure Machine Learning Studio. The experiment is a workspace where you can build, train, and evaluate machine learning models.
- Add data: In the experiment, you can add the prepared data to the workspace.
- Choose a machine learning algorithm: In Azure Machine Learning Studio, you can choose from a variety of machine learning algorithms such as regression, classification, clustering, and recommendation systems. You can drag and drop the algorithm to the experiment workspace.
- Train the model: Once the algorithm is added to the workspace, you can configure the parameters and hyperparameters of the algorithm and then run the experiment to train the model.
- Evaluate the model: After the model is trained, you can evaluate its performance using various metrics such as accuracy, precision, recall, and F1 score.
- Deploy the model: Once the model is trained and evaluated, you can deploy it to Azure as a web service or as a batch execution service. You can use the REST API to integrate the model into your applications or use the Azure portal to manage the deployment.
26. How do you manage and store data in Azure, and what tools do you use for this?
Azure provides a range of services for managing and storing data in the cloud. Here are some of the tools and services that can be used for this purpose:
- Azure SQL Database: This is a fully managed relational database service that provides high availability, automatic backups, and scalability. It supports various SQL Server features and can be used to store and manage structured data.
- Azure Cosmos DB: This is a globally distributed, multi-model database service that supports NoSQL databases such as document, key-value, graph, and column-family. It offers automatic scalability and high availability and can be used to store and manage unstructured data.
- Azure Blob Storage: This is a fully managed object storage service that can be used to store and manage unstructured data such as images, videos, and documents. It provides high availability, durability, and scalability, and can be accessed using REST APIs.
- Azure Data Lake Storage: This is a scalable and secure data lake service that can be used to store and manage large amounts of unstructured and structured data. It provides granular access controls and can be accessed using various tools such as Azure Data Factory and Azure Databricks.
- Azure Backup: This is a backup and disaster recovery service that can be used to protect and recover data in Azure. It provides automatic backups and can be used to backup data from on-premises environments as well as Azure services such as Azure VMs and Azure File Shares.
- Azure Site Recovery: This is a disaster recovery service that can be used to replicate and recover applications and workloads to Azure or another location. It provides near-zero RPO and RTO and can be used to replicate workloads from on-premises environments as well as Azure services such as Azure VMs.
27. What exactly is the F1 score, and how do you compute it?
The F1 score assists us in calculating the harmonic mean of precision and recall, which is used to determine the test’s correctness. Precision and recall are accurate if F1 = 1. Precision and recall are less accurate or completely wrong if F1 is less than 1 or equal to 0.
28. Can you explain the difference between Azure SQL Database and Azure SQL Data Warehouse, and how they are used for data analysis?
Azure SQL Database and Azure SQL Data Warehouse are both cloud-based database services offered by Microsoft Azure, but they serve different purposes in data analysis.
Azure SQL Database is a fully managed relational database service that is designed for transactional workloads and supports online transaction processing (OLTP) scenarios. It is a scalable and highly available database service that can be used for data storage, data retrieval, and data processing. Azure SQL Database provides features such as automatic patching, backup and recovery, and built-in security features.
Azure SQL Data Warehouse, on the other hand, is a fully managed, elastic, and petabyte-scale data warehouse service that is designed for analytics workloads and supports online analytical processing (OLAP) scenarios. It allows users to store and analyze large amounts of data using distributed computing and columnar storage. Azure SQL Data Warehouse provides features such as parallel data processing, automatic scaling, and integration with other Azure services such as Power BI and Azure Machine Learning.
29. What is the distinction between a mistake and a residual mistake?
A value mistake happens, whereas a prediction shows the difference between the observed and true values of a dataset. The residual error, on the other hand, is the difference between the observed and projected values. Because the true values are never known, we utilise the residual error to assess an algorithm’s performance. As a result, we employ residuals to calculate the inaccuracy based on the observed values. It assists us in determining a precise estimate of the mistake.
30. What is the relationship between Data Science and Machine Learning?
Data Science and Machine Learning are closely related terms that are sometimes misinterpreted. They both work with data. However, there are a few key distinctions that help us understand how they differ.
Data Science is a vast field that works with large amounts of data and helps us to extract meaning from it. The overall Data Science process takes care of a number of phases that are involved in extracting insights from the given data. Data collection, data analysis, data modification, data visualisation, and other critical phases are all part of this process.
Data Science, on the other hand, can be regarded of as a sub-field of Machine Learning. It likewise deals with data, but we’re only interested in learning how to turn the processed data into a functional model that can be used to map inputs to outputs, such as a model that can take an image as an input and tell us if it contains a flower as an output.
In a nutshell, Data Science is concerned with acquiring data, processing it, and then extracting conclusions from it. Machine Learning is a branch of Data Science that deals with creating models using algorithms. As a result, Machine Learning is an essential component of Data Science.
31. Describe the differences between univariate, bivariate, and multivariate analysis.
- Univariate analysis entails assessing data using only one variable, or, in other words, a single column or vector of the data. This analysis enables us to comprehend the information and discover patterns and trends. Example: Analyzing the weight of a group of people.
- Bivariate analysis entails examining data using only two variables, or in other words, the data can be organised into a two-column table. We can figure out the relationship between the variables using this type of analysis. Analyzing data containing temperature and altitude, for example.
- Multivariate analysis is a type of data analysis that encompasses more than two variables. The data can have any number of columns other than two. We can find out the impacts of all other variables (input variables) on a single variable using this type of analysis (the output variable). Analyzing data on property pricing, which includes information on the homes such as location, crime rate, area, number of floors, and so on.
32. What are our options for dealing with missing data?
- To handle missing data, we must first determine the percentage of data that is missing in a certain column so that we may adopt an acceptable technique for dealing with the circumstance. If the majority of the data in a column is missing, for example, removing the column is the best option unless we have some way of making reasonable estimates about the missing values. If the amount of missing data is little, however, we have numerous options for filling it in.
- Filling them all with a default value or the value with the highest frequency in that column, such as 0 or 1, is one option. This could be handy if these values appear in the bulk of the data in that column.
- Another option is to use the mean of all the values in the column to fill in the missing values in the column. This method is commonly used since missing values are more likely to be closer to the mean than to the mode.
- Finally, if we have a large dataset and a few rows contain missing values in some columns, dropping those columns is the simplest and fastest solution. Dropping a few columns shouldn’t be a problem because the dataset is so vast.
33. What are the advantages of reducing dimensionality?
The dimensions and size of the entire dataset are reduced by dimensionality reduction. It removes extraneous elements while keeping the data’s general information intact. Data processing is faster when the dimensions are reduced. The fact that data with a lot of dimensions takes a long time to process and train a model on is one of the reasons why it’s so difficult to deal with. Reducing the number of dimensions speeds up the process, reduces noise, and improves model accuracy.
34. In Data Science, what is the bias–variance trade-off?
Our goal when using Data Science or Machine Learning is to create a model with low bias and variance. We all know that bias and variance are errors that emerge when a model is either either simple or overly sophisticated. As a result, while we’re developing a model, we’ll only be able to achieve high accuracy if we understand the tradeoff between bias and variance.
When a model is too simple to capture the patterns in a dataset, it is said to be biassed. We need to make our model more complicated to reduce bias. Although increasing the complexity of our model can reduce bias, if we make it too complicated, it may become excessively inflexible, resulting in high variance. As a result, the tradeoff between bias and variance is that increasing complexity reduces bias while increasing variance, and decreasing complexity increases bias while decreasing variance. Our goal is to establish a balance between a model that is complicated enough to produce minimal bias but not so complex that it produces significant variance.
35. What do you mean by the term RMSE?
The root mean square error is abbreviated as RMSE. It’s a metric for regression accuracy. The root mean square error (RMSE) is a method for calculating the extent of error created by a regression model. The following is how RMSE is calculated:
First, we calculate the errors in the regression model’s predictions. The discrepancies between the actual and anticipated values are calculated in this way. The mistakes are then squared. The mean of the squared errors is then calculated, and the square root of the mean of these squared mistakes is then calculated. This is the RMSE, and a model with a lower RMSE is thought to produce fewer errors, implying that the model is more accurate.
36. In SVM, what is a kernel function?
A kernel function is a particular mathematical function used in the SVM algorithm. A kernel function, in simple words, takes data as input and turns it into the desired format. This data transformation is based on something known as a kernel trick, which is also the name of the kernel function. We can transform data that is not linearly separable (cannot be split using a straight line) into data that is linearly separable using the kernel function.
37. How can we select an appropriate value of k in k-means?
Choosing the right k value is a crucial part of k-means clustering. The elbow approach can be used to determine the proper k value. To do so, we use the k-means method on a set of values ranging from 1 to 15. We compute an average score for each value of k. Inertia, or inter-cluster variance, is another name for this score.
This is determined as the sum of squares of all values in a cluster’s distances. We can witness a dramatic drop in the inertia value when k increases from a low value to a large value. The decline in inertia value gets fairly minor after a certain value of k in the range.
38. How can we deal with outliers?
Outliers can be handled in a variety of ways. Dropping them is one option. Outliers can only be removed if their values are inaccurate or severe. If a dataset containing baby weights has a value of 98.6 degrees Fahrenheit, for example, it is erroneous. If the value is 187 kg, it is an extreme figure that is unsuitable for our model.
- This is a unique model. If we were using a linear model, for example, we may switch to a non-linear model.
- The data will be normalised, bringing the extreme values closer to the other data points.
- Using methods like random forest, which are less affected by outliers.
39. What is ensemble learning, and how does it work?
Our goal when utilising Data Science and Machine Learning to construct models is to get a model that can grasp the underlying trends in the training data and make accurate predictions or classifications. However, some datasets are extremely complex, and it can be challenging for a single model to comprehend the underlying trends in these datasets. In some cases, we merge multiple unique models to boost performance. This is referred to as ensemble learning.
40. Explain how recommender systems use collaborative filtering.
The approach of collaborative filtering is used to create recommender systems. To produce suggestions, we use data about the likes and dislikes of users who are comparable to other users in this technique. This resemblance is calculated using a variety of factors such as age, gender, location, and so on. If User A, like User B, viewed and enjoyed a film, that film will be recommended to User B, and vice versa if User B enjoyed a film, that film will be recommended to User A. To put it another way, the movie’s content is unimportant.
41. Explain content-based filtering in recommender systems.
One of the strategies used to develop recommender systems is content-based filtering. The features of the content that a user is interested in are used to generate suggestions in this technique. For example, if a user watches action and mystery movies and gives them high ratings, it is evident that the user like these kind of films. If the user is presented movies in a similar genre as recommendations, there is a greater chance that the user will enjoy such recommendations. To put it another way, when making suggestions, the movie’s substance is taken into account.
42. In Data Science, explain the concept of bagging.
Bagging is a type of ensemble learning. Bootstrap aggregation is what it’s called. We generate some data with this methodology by employing the bootstrap method, which uses an existing dataset to generate many samples of the N size. The bootstrapped data is then utilised to train many models in simultaneously, resulting in a more robust bagging model than a simple model. When we need to make a prediction after all of the models have been trained, we use all of the trained models to make predictions and then average the results in the case of regression, and in the case of classification, we choose the result provided by the models with the highest frequency.
42. Explain the concept of boosting in Data Science.
One of the ensemble learning strategies is boosting. Unlike bagging, it is not a technique for training our models in parallel. In boosting, we generate many models and train them sequentially by combining weak models iteratively in such a way that the training of a new model is dependent on the training of previous models. When training the new model, we use the patterns learned by the prior model and test them on a dataset. We assign extra weight to observations in the dataset that are wrongly handled or predicted by prior models in each iteration. Boosting can also be used to reduce model bias.
43. Explain the concept of stacking in Data Science.
Stacking, like bagging and boosting, is an ensemble learning technique. We could only combine weak models that employed the same learning algorithms, such as logistic regression, in bagging and boosting. Homogeneous learners are the name given to these models. Stacking, on the other hand, allows us to combine weak models that use different learning techniques. Heterogeneous learners are those who learn in a variety of ways. Stacking works by training many (and diverse) weak models or learners, then combining them by training a meta-model to create predictions based on the multiple outputs or predictions produced by the multiple weak models.
44. Distinguish between Machine Learning and Deep Learning.
- Machine Learning is a subset of Data Science that deals with using current data to assist systems in learning new skills to accomplish different tasks without the need for explicit rules to be coded.
- Deep Learning, on the other hand, is a branch of Machine Learning that works with creating Machine Learning models using algorithms that attempt to mimic how the human brain learns from data in order to gain new abilities. We employ a lot of densely linked neural networks with a lot of layers in Deep Learning.
45. Mention the different kernel functions that can be used in SVM.
In SVM, there are four types of kernel functions:
- Linear kernel
- Polynomial kernel
- Radial basis kernel
- Sigmoid kernel
46. How can you tell if the data in a time series is stationary?
When the variance or mean of a time series data set remains constant throughout time, it is referred to be stationary data. If the variance or mean of the dataset does not vary over time, we can conclude that the data is stationary for that time period.
47. What does it mean to conduct a root cause analysis?
The technique of determining the root causes of certain problems or failures is known as root cause analysis. A factor is termed a root cause if, once it is removed, a series of actions that previously resulted in a fault, error, or undesired consequence now function correctly. Root cause analysis is a technique that was first created and used to the investigation of industrial accidents, but it is now employed in a wide range of situations.
48. What is A/B testing, and how does it work?
For randomised studies with two variables, A/B testing is a type of statistical hypothesis testing. These variables are denoted by the letters A and B. When we want to test a new feature in a product, we employ A/B testing. We give users two variants of the product in the A/B test, and we designate these variants A and B. The A variant might be the product with the new feature, whereas the B variant could be the one without it. We collect customer ratings for these two goods after they have used them. If product version A receives a statistically significant higher rating, the new feature is deemed an improvement and helpful, and it is accepted.
49. Which do you think is superior, collaborative filtering or content-based filtering, and why?
For generating recommendations, content-based filtering is thought to be superior to collaborative filtering. This isn’t to say that collaborative filtering produces poor recommendations. However, because collaborative filtering is reliant on the preferences of other users, we can’t put a lot of faith in it. In addition, users’ preferences may shift in the future. For example, a user may enjoy a film now that he or she did not enjoy ten years ago. Furthermore, members who share some characteristics may not have the same preferences for the type of content available on the site.
50. Why does the name Naive Bayes include the term “naive”?
The Naive Bayes algorithm is a Data Science technique. Because it is based on the Bayes theorem, which deals with the probability of an event occurring given that another event has already occurred, it contains the term “Bayes” in it.
It has the word “naive” in it because it assumes that each variable in the dataset is unrelated to the others. For real-world data, this kind of assumption is unrealistic. Even with this assumption, it is quite beneficial for tackling a variety of difficult issues, such as spam email classification.
51. What is reinforcement learning, and how does it work?
Reinforcement learning is a subset of Machine Learning that focuses on creating software agents that do behaviours in order to maximise the amount of cumulative rewards. Here, a reward is utilised to inform the model (during training) whether a specific activity leads to the achievement of or puts it closer to the objective. For example, if we create an ML model that plays a video game, the reward will be either the points gathered or the level reached in the game. Reinforcement learning is used to create these types of agents that can make real-world decisions that will help the model achieve a clearly defined goal.
52. Explain the concept of TF/IDF vectorization.
The acronym Term Frequency–Inverse Document Frequency (TF/IDF) stands for Term Frequency–Inverse Document Frequency. It is a numerical metric that allows us to evaluate how essential a word is to a document within a corpus of papers. Text mining and information retrieval frequently use TF/IDF.
53. What assumptions are necessary for linear regression?
Linear regression necessitates a number of assumptions. The following are the details:
- The data used to train the model, which is a sample selected from a population, should be representative of the population.
- The mean of dependent factors and independent variables have a linear relationship.
- For any value of an independent variable, the variance of the residual will be the same. It can alternatively be written as X.
- Each observation exists independently of the others.
- The independent variable is normally distributed for any value of the independent variable.
54. What happens if some of the assumptions that linear regression relies on are broken?
These assumptions may be mildly broken (i.e., minor violations) or severely violated (i.e., serious violations) (i.e., the majority of the data has violations). In a linear regression model, both of these violations will have different impacts.
Violations of these assumptions render the results completely useless. The results will have more bias or variance if these assumptions are violated lightly.
55. Define Syntax error.
This includes checking for mistakes, ensuring there is no white space, and ensuring letter casing is consistent. .unique() and bar graphs can both be used to check for mistakes.
56. What do you mean by the term Standardization or normalization?
Depending on the dataset you’re working with and the machine learning approach you use, standardising or normalising your data can help ensure that differing scales of distinct variables don’t negatively effect your model’s performance.
57. How to handle Null values?
Null values can be handled in a variety of ways, including deleting rows with null values entirely, replacing null values with the mean/median/mode, replacing null values with a new category (e.g. unknown), forecasting values, or utilising machine learning models that can handle null values.
58. What do you mean by Histogram?
Histograms are bar charts that depict the frequency of the values of a numerical variable and are used to approximate the probability distribution of that variable. It enables you to immediately grasp the distribution’s structure, variance, and possible outliers.
59. How to select metrics?
There is no such thing as a universal measure. The metric(s) used to assess a machine learning model are determined by a number of factors:
- Is it a classification or regression task?
- What is the company’s goal? Consider the contrast between precision and recall.
- What is the target variable’s distribution?
- Adjusted r-squared, MAE, MSE, accuracy, recall, precision, f1 score, and so on are just a few of the metrics that can be employed.
60. What is ‘false positive’?
A false positive is when a condition is mistakenly identified as present when it is not.
61. Define the term ‘false negative’.
A false negative occurs when a condition is mistakenly identified as absent when it is actually present.
62. What to do you mean by the term Overfitting ?
Overfitting is a mistake in which the model ‘fits’ the data too well, resulting in a model with a lot of variance and little bias. As a result, even if an overfit model has a high accuracy on the training data, it will predict new data points incorrectly.
63. What is a Convex Function?
A convex function is one in which a line drawn between any two points on the graph lies on the graph or above it. There is only one requirement.
64. Define non-convex Function.
A non-convex function is one in which a line drawn between any two points on the graph may overlap additional points. It was described as “wavy.”
65. What criteria do you use to determine the statistical significance of a finding?
To evaluate statistical significance, you would use hypothesis testing. The null hypothesis and alternate hypothesis would be stated first. Second, you’d calculate the p-value, which is the likelihood of getting the test’s observed results if the null hypothesis is true. Finally, you would set the level of significance (alpha) and reject the null hypothesis if the p-value is smaller than the alpha — in other words, the result is statistically significant.
66. How does experimental data contrast with observational data?
- Observational data is derived from observational studies, which are conducted when specific variables are observed and a link is sought.
- Experimental data is derived from experiments in which specific factors are controlled and held constant in order to establish if there is any causality.
67. What is the statistical significance?
The power of a binary hypothesis, or the probability that the test rejects the null hypothesis if the alternative hypothesis is true, is referred to as statistical power.
68. Give examples of data that does not have a Gaussian distribution, nor log-normal.
- Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
- Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.
69. Give an example where the median is a better measure than the mean
When there are a number of outliers that positively or negatively skew the data.
70. What is Law of Large Numbers?
The Law of Large Numbers asserts that as the number of trials rises, the average result will approach the predicted value.
Expert Corner
These are a few of the important questions for a Data Scientist interview. If you’re just starting out in Data Science, these interview questions can undoubtedly help you assess and improve your current level of skill. We hope this information was useful! This contains both entry-level and expert-level questions and answers. You can also, refer to various online tutorials to gather more knowledge about Data Science.We hope that this set of Data Science Interview Questions and Answers will aid you in your interview preparation. Best wishes!