In today’s data-driven world, the demand for skilled Data Analysts is at an all-time high — and India is no exception. With companies across industries relying on data to guide strategic decisions, the role of a data analyst has become not just desirable but essential.
Whether you’re a fresher just stepping out of college, a career switcher from another domain, or an experienced professional aiming for a better opportunity, cracking the data analyst interview can be your key to an exciting, high-growth career.
But with so many tools, concepts, and questions to prepare for — from Python, SQL, and Excel to EDA, machine learning basics, and data visualisation tools like Tableau — it can feel overwhelming.
That’s why we’ve curated this comprehensive list of 90+ data analyst interview questions, complete with detailed answers, tips, and real-world context tailored for Indian aspirants. Whether you’re applying for roles in top IT firms, startups, fintech companies, or MNCs, this guide will give you the confidence and clarity you need to excel in your interviews.
Let’s dive into the Data Analyst Interview Questions that matter and help you land your dream data analyst job in India!
- What is Data Analytics?
- Data Analyst Interview Questions for Freshers
- Data Analyst Interview Questions for Experienced
- General Data Analyst Interview Questions
- Excel-Specific Questions
- SQL-Specific Questions
- Tableau-Specific Questions
- Scikit-learn (Machine Learning & Preprocessing)
- Statsmodels
- Other Relevant Python Libraries
- ETL & Data Engineering Tools
- Performance Optimisation & Debugging
What is Data Analytics?

Data Analytics is the process of examining raw data to uncover meaningful insights, patterns, and trends that help in decision-making. It involves a combination of statistical techniques, programming tools, and domain knowledge to transform data into actionable intelligence.
In simple terms, data analytics answers questions like:
- What happened? (Descriptive Analytics)
- Why did it happen? (Diagnostic Analytics)
- What will happen? (Predictive Analytics)
- What should we do next? (Prescriptive Analytics)
Key Steps in Data Analytics:
- Data Collection – Gathering data from sources like databases, APIs, surveys, or sensors.
- Data Cleaning – Removing errors, duplicates, and inconsistencies.
- Data Exploration – Using visualisations and statistics to understand patterns.
- Data Modelling – Applying algorithms or statistical models to identify relationships or make predictions.
- Data Interpretation – Drawing conclusions and creating reports or dashboards.
Why is Data Analytics Important?
- Helps organisations make informed business decisions
- Improves efficiency by identifying bottlenecks
- Enables personalisation in marketing and user experiences
- Supports risk management and fraud detection
- Drives innovation through data-driven insights
In the Indian job market, data analytics is a highly sought-after skill, especially in sectors like IT, banking, e-commerce, healthcare, and telecom. Learning data analytics opens doors to roles such as Data Analyst, Business Analyst, Data Scientist, and Data Engineer. So, let’s start learning and preparing for the Data Analyst interview.
Data Analyst Interview Questions for Freshers
1. What do you mean by collisions in a hash table? Explain the ways to avoid it.
A collision in a hash table occurs when two different keys produce the same hash index. Since each index should ideally hold one unique key-value pair, collisions lead to ambiguity about which item to store or retrieve.
Ways to handle or avoid collisions:
- Chaining: Store multiple elements at the same index using a linked list.
- Open Addressing: Find the next available slot using methods like linear probing or quadratic probing.
- Double Hashing: Use a second hash function to decide where to place the value.
- Good Hash Functions: Use hash functions that minimise collisions by evenly distributing keys.
2. What are the ways to detect outliers? Explain different ways to deal with them.
Outliers are data points that deviate significantly from the rest of the dataset.
Detection methods:
- Box Plot (IQR method): Values outside 1.5 × IQR are considered outliers.
- Z-score: Points with a Z-score > ±3 are potential outliers.
- Scatter plots: Visual anomalies.
- Isolation Forest or DBSCAN: Machine learning-based detection.
Dealing with outliers:
- Remove if they are errors or irrelevant.
- Cap or floor extreme values (winsorisation).
- Impute using mean/median/nearest neighbours.
- Transform the data (e.g., log transformation) to reduce skewness.
3. What is the data analysis process?
The data analysis process typically involves the following steps:
- Define the objective or business problem.
- Data collection from various sources (databases, APIs, files).
- Data cleaning to remove inconsistencies and missing values.
- Exploratory Data Analysis (EDA) to identify patterns and relationships.
- Modelling or statistical analysis using appropriate tools.
- Visualisation and reporting to communicate insights.
- Decision-making based on findings.
4. What are the different challenges one faces during data analysis?
Common challenges include:
- Missing or incomplete data
- Duplicate records
- Inconsistent data formats
- Outliers and noise
- Data integration from multiple sources
- Scalability issues with large datasets
- Biased or unrepresentative data
- Lack of domain knowledge
- Poor documentation
5. Explain data cleansing.
Data cleansing (or data cleaning) is the process of identifying and correcting inaccurate, incomplete, or inconsistent data to improve data quality.
Tasks involved:
- Removing duplicates
- Handling missing values
- Correcting data types or formatting issues
- Resolving inconsistencies (e.g., ‘UK’ vs ‘United Kingdom’)
- Standardising entries
Effective data cleaning ensures the reliability of the analysis and model accuracy.
6. Which validation methods are employed by data analysts?
Validation ensures that the data model or analysis is accurate and performs well. Common validation techniques include:
- Cross-validation (K-Fold): Splits data into k subsets to train and validate multiple times.
- Train/Test Split: Reserve part of the dataset for testing.
- Bootstrapping: Repeated random sampling with replacement.
- Hold-out validation: Split into training, validation, and test sets.
- Stratified Sampling: Ensures representative proportions of categories in each subset (especially for imbalanced data).
7. What are the responsibilities of a Data Analyst?
Typical responsibilities include:
- Collecting and organising data from various sources.
- Cleaning and transforming data for analysis.
- Performing statistical analysis and identifying patterns.
- Creating dashboards and visual reports to present findings.
- Communicating insights to stakeholders.
- Collaborating with business teams to inform decision-making.
- Automating reports and repetitive tasks using tools like SQL or Python.
- Monitoring KPIs and business metrics.
8. Explain the KNN imputation method.
K-Nearest Neighbours (KNN) imputation is a technique to fill in missing values based on similarity to other records.
- For each missing value, the algorithm identifies the ‘k’ most similar rows (neighbours) using a distance metric (e.g., Euclidean).
- The missing value is then imputed with the average (for numeric) or mode (for categorical) value of the neighbours.
KNN imputation is more accurate than mean/median imputation but computationally expensive on large datasets.
9. What do you mean by data visualisation?
Data visualisation is the graphical representation of information and data using visual elements like charts, graphs, and maps.
It helps in:
- Identifying trends, correlations, and outliers.
- Communicating complex information in an intuitive format.
- Simplifying large and complex datasets.
- Supporting decision-making.
10. How does data visualisation help you?
Data visualisation helps by:
- Revealing patterns and relationships not easily seen in raw data.
- Speeding up insights and supporting quicker decisions.
- Improving understanding across technical and non-technical stakeholders.
- Highlighting anomalies and data quality issues.
- Making reports more interactive and engaging (e.g., dashboards).
Tools commonly used: Tableau, Power BI, Matplotlib, Seaborn, Plotly.
11. Mention some of the Python libraries used in data analysis.
Popular Python libraries for data analysis include:
- NumPy – Numerical operations and array handling.
- Pandas – Data manipulation and analysis.
- Matplotlib / Seaborn / Plotly – Visualisation libraries.
- SciPy – Scientific computing and statistical analysis.
- Scikit-learn – Machine learning and modelling.
- Statsmodels – Statistical testing and modelling.
- OpenPyXL / xlrd – Working with Excel files.
- SQLAlchemy – Database interaction.
12. Explain a hash table.
A hash table is a data structure that stores key-value pairs. It uses a hash function to compute an index into an array of buckets from which the desired value can be found.
- Offers constant time complexity for search, insert, and delete (on average).
- Collisions are handled using chaining or open addressing.
- Widely used in dictionaries, caches, and indexing databases.
Data Analyst Interview Questions for Experienced
1. What are the characteristics of a good data model?
A good data model should exhibit the following characteristics:
- Accuracy: Reflects the real-world process it represents.
- Simplicity: Easily understandable, with minimal complexity.
- Scalability: Can handle data growth without significant redesign.
- Flexibility: Can be adapted to future changes in business needs.
- Integrity: Maintains data consistency and correctness.
- Performance: Optimised for fast queries and updates.
- Documentation: Clearly documented for easy handover and maintenance.
2. Explain Collaborative Filtering.
Collaborative filtering is a recommendation technique used in systems like Netflix or Amazon. It is based on the assumption that users who agreed in the past will likely agree again in the future.
Types:
- User-based filtering: Recommends items based on similar users.
- Item-based filtering: Recommends items similar to those the user has liked.
It works by finding patterns in user-item interactions without requiring explicit information about the items themselves.
3. What do you mean by clustering algorithms? Write different properties of clustering algorithms.
Clustering algorithms group a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups.
Properties:
- Scalability: Can handle large datasets efficiently.
- Shape and size sensitivity: Some algorithms (like DBSCAN) detect clusters of arbitrary shape.
- Noise handling: Some methods are robust to outliers.
- Number of clusters: Some require a predefined cluster count (e.g., K-means), while others do not (e.g., DBSCAN).
- Interpretability: The results should be meaningful and explainable.
Common clustering algorithms: K-means, Hierarchical, DBSCAN, Mean Shift.
4. What is a Pivot Table? Write its usage.
A Pivot Table is a data summarisation tool commonly used in Excel and BI tools to automatically sort, count, and total data stored in a table.
Usage:
- Summarize large datasets with ease.
- Perform dynamic aggregation (SUM, COUNT, AVERAGE, etc.).
- Slice and dice data using row and column fields.
- Create cross-tabulations and custom reports without formulas.
5. Explain Hierarchical Clustering.
Hierarchical clustering builds a hierarchy of clusters either:
- Agglomerative (Bottom-up): Starts with individual data points and merges them into clusters.
- Divisive (Top-down): Starts with all data points in one cluster and splits them.
Dendrograms are used to visualise the hierarchy. This method does not require the number of clusters in advance but is computationally expensive for large datasets.
6. Name some popular tools used in Big Data.
- Apache Hadoop: Distributed storage and processing
- Apache Spark: In-memory processing engine for large-scale data
- Kafka: Real-time data streaming
- Hive: SQL-like interface for Hadoop
- Pig: Data flow language for ETL
- Flink: Real-time stream processing
- HBase, Cassandra: NoSQL databases for massive-scale data
- Airflow: Workflow automation tool
7. What do you mean by Logistic Regression?
Logistic Regression is a classification algorithm used to model the probability of a binary outcome (e.g., yes/no, 1/0, true/false).
- It uses the logistic (sigmoid) function to constrain output between 0 and 1.
- It estimates the relationship between independent variables and a categorical dependent variable.
- Common in spam detection, churn prediction, and medical diagnosis.
8. What do you mean by the K-means algorithm?
K-means is an unsupervised clustering algorithm that partitions the dataset into K distinct clusters based on distance to the cluster centres (centroids).
Steps:
- Select K initial centroids.
- Assign each point to the nearest centroid.
- Recalculate the centroids.
- Repeat until convergence.
It is fast and works well with large datasets but is sensitive to the initial choice of centroids and assumes spherical clusters.
9. Write the difference between Variance and Covariance.
Variance measures the spread of a single variable around its mean.

Covariance measures the relationship between two variables.
Positive covariance: both variables move in the same direction.
Negative covariance: one increases as the other decreases.

10. What are the advantages of using Version Control?
- Track changes in code or data analysis scripts.
- Collaboration: Multiple users can work on the same project without conflict.
- Rollback capability: Easily revert to a previous version.
- Documentation: Maintains a log of changes over time.
- Branching: Experiment without affecting the main codebase.
Tools: Git, GitHub, Bitbucket, GitLab.
11. Explain N-gram.
An N-gram is a sequence of ‘n’ items (typically words or characters) from a given text. Used primarily in natural language processing (NLP).
- Unigram: One word
- Bigram: Two consecutive words
- Trigram: Three consecutive words
Applications include:
- Text prediction
- Sentiment analysis
- Spelling correction
- Machine translation
12. Mention some of the statistical techniques used by Data Analysts.
- Descriptive statistics: Mean, median, mode, variance, standard deviation
- Inferential statistics: Hypothesis testing, confidence intervals, p-values
- Regression analysis: Linear and logistic regression
- Time series analysis: ARIMA, exponential smoothing
- ANOVA and t-tests
- Chi-square tests
- Correlation and covariance analysis
- Bayesian statistics (for probabilistic reasoning)
General Data Analyst Interview Questions
1. Define the term ‘Data Wrangling’ in Data Analytics.
Data wrangling (or data munging) is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making. It includes handling missing values, removing duplicates, correcting data types, and transforming features to prepare data for analysis or modelling.
2. What are the various steps involved in any analytics project?
- Problem definition
- Data collection
- Data cleaning/wrangling
- Exploratory Data Analysis (EDA)
- Feature engineering
- Model building (if applicable)
- Evaluation and validation
- Presentation and reporting
- Deployment and monitoring (if ML is used)
3. Which technical tools have you used for analysis and presentation?
- Python: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
- SQL: PostgreSQL, MySQL
- Excel: Pivot tables, formulas, charts
- Tableau/Power BI: Dashboards and visualisation
- Jupyter Notebooks: For storytelling with code
- ETL tools: Airflow, Alteryx, Talend (if applicable)
4. What are the best methods for data cleaning?
- Handling missing values (imputation or deletion)
- Removing or correcting duplicates
- Filtering outliers
- Standardising data formats and data types
- Validating ranges and domains
- Normalising or scaling features
- Replacing inconsistent categories (e.g. “M”, “Male”, “male” → “Male”)
5. What is the significance of Exploratory Data Analysis (EDA)?
EDA helps in understanding the data’s structure, spotting patterns, detecting anomalies, testing hypotheses, and preparing data for modelling. Visual and statistical summaries can uncover hidden relationships and guide the choice of algorithms.
6. Explain descriptive, predictive, and prescriptive analytics.
- Descriptive: Summarises historical data (e.g., “What happened?”)
- Predictive: Forecasts future trends using models (e.g., “What could happen?”)
- Prescriptive: Suggests actions using optimisation or decision logic (e.g., “What should we do?”)
7. What are the different types of sampling techniques used by data analysts?
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Systematic sampling
- Convenience sampling
Stratified and cluster sampling are used when the population has identifiable subgroups.
8. What are the ethical considerations of data analysis?
- Data privacy and confidentiality
- Bias and fairness in algorithms
- Transparency in methodologies
- Consent for data collection
- Avoiding manipulation of results or misleading visualisations
9. How can you handle missing values in a dataset?
- Drop rows/columns with too many missing values
- Impute using mean, median, or mode
- Predict missing values using regression or KNN
- Flag missing values with indicators
- Domain-specific methods may also be applied.
10. How is Overfitting different from Underfitting?
- Overfitting: Model learns noise along with the pattern, performs well on training but poorly on unseen data
- Underfitting: Model fails to capture patterns in data, performing poorly even on training data
Regularisation, cross-validation, and appropriate model complexity help mitigate these issues.
11. What are the different types of Hypothesis Testing?
- Z-test: Large samples and known population variance
- T-test: Small samples, unknown population variance
- Chi-square test: Categorical data
- ANOVA: Comparing means of 3 or more groups
- Mann-Whitney U, Kruskal-Wallis: Non-parametric alternatives
12. Explain Type I and Type II errors in Statistics.
- Type I error (False Positive): Rejecting a true null hypothesis
- Type II error (False Negative): Failing to reject a false null hypothesis
There is a trade-off between the two, managed via significance levels (alpha).
Excel-Specific Questions
1. In Microsoft Excel, a numeric value can be treated as text if it precedes with what?
A numeric value is treated as text if it begins with an apostrophe (‘). For example, '123
.
2. What is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF?
- COUNT: Counts numeric cells
- COUNTA: Counts non-empty cells
- COUNTBLANK: Counts blank cells
- COUNTIF: Counts cells based on a condition (e.g.,
=COUNTIF(A1:A10, ">50")
)
3. How do you make a dropdown list in MS Excel?
Use Data Validation:
Go to Data
→ Data Validation
→ Allow: List
→ Enter the values or reference a range.
4. Can you provide a dynamic range in “Data Source” for a Pivot Table?
Yes, use Excel Tables (Insert
→ Table
) or define a Named Range using OFFSET
and COUNTA
functions for dynamic data sources.
5. What is the function to find the day of the week for a particular date value?
Use TEXT(date, "dddd")
for full name (e.g., Monday), or TEXT(date, "ddd")
for short form (e.g., Mon).
6. How does the AND() function work in Excel?AND(condition1, condition2, ...)
returns TRUE only if all conditions are TRUE. Often used in conjunction with IF
.
7. Explain how VLOOKUP works in Excel.VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
searches for a value in the first column of a range and returns the corresponding value in a specified column.
8. What function would you use to get the current date and time in Excel?
=NOW()
returns date and time=TODAY()
returns current date only
SQL-Specific Questions
1. How do you subset or filter data in SQL?
Use the WHERE
clause:
sqlCopyEditSELECT * FROM employees WHERE department = 'Sales';
2. What is the difference between a WHERE clause and a HAVING clause in SQL?
WHERE
: Filters rows before groupingHAVING
: Filters groups after aggregation
sqlCopyEditSELECT department, COUNT(*) FROM employees GROUP BY department HAVING COUNT(*) > 5;
3. How are UNION, INTERSECT, and EXCEPT used in SQL?
- UNION: Combines rows from two queries (removes duplicates)
- INTERSECT: Returns common rows
- EXCEPT: Returns rows from the first query not in the second
4. What is a Subquery in SQL?
A query inside another query. It can be used in SELECT
, FROM
, or WHERE
clauses.
Example:
sqlCopyEditSELECT name FROM employees WHERE id IN (SELECT id FROM attendance WHERE status = 'Absent');
5. How do you write a Stored Procedure in SQL?
A stored procedure is a named set of SQL statements saved in the database. Example (in MySQL):
sqlCopyEditDELIMITER //
CREATE PROCEDURE GetEmployees()
BEGIN
SELECT * FROM employees;
END //
DELIMITER ;
6. Write an SQL stored procedure to find the total even numbers between two user-given numbers.
sqlCopyEditDELIMITER //
CREATE PROCEDURE CountEvenNumbers(IN num1 INT, IN num2 INT)
BEGIN
DECLARE cnt INT DEFAULT 0;
DECLARE i INT;
SET i = num1;
WHILE i <= num2 DO
IF MOD(i, 2) = 0 THEN
SET cnt = cnt + 1;
END IF;
SET i = i + 1;
END WHILE;
SELECT cnt AS EvenCount;
END //
DELIMITER ;
Tableau-Specific Questions
1. How is Joining different from Blending in Tableau?
- Joining: Performed within the same data source (SQL-style)
- Blending: Combines data from different sources using a common field (like a left join)
2. What do you understand by LOD Expressions in Tableau?
Level of Detail (LOD) Expressions allow you to control the granularity of aggregations.
Types:
- Fixed: Independent of view (e.g.,
{FIXED Region: AVG(Sales)}
) - Include/Exclude: Adjust aggregation levels relative to current view
3. What are the different connection types in Tableau Software?
- Live Connection: Real-time data; slower for large datasets
- Extract: Data is imported into Tableau as a static snapshot for faster performance
4. What are the different joins that Tableau provides?
- Inner Join
- Left Join
- Right Join
- Full Outer Join
5. What is a Gantt Chart in Tableau?
A Gantt chart displays tasks along a timeline, useful for project scheduling or resource management. Each bar represents a task duration.
6. What is the difference between Treemaps and Heatmaps in Tableau?
- Treemaps: Hierarchical, space-filling visualisation of data using rectangles
- Heatmaps: Grid of coloured cells showing intensity or density of a variable
Python & General Data Handling
1. What are Python’s advantages for data analysis?
Python is favoured for data analysis due to its readability, extensive ecosystem of libraries (like Pandas, NumPy, Matplotlib, Scikit-learn), strong community support, and integration capabilities with web frameworks, databases, and cloud services. Its syntax allows analysts to write efficient, readable, and maintainable code, which is crucial for collaboration and scaling projects.
2. How is a list different from a tuple in Python?
- List: Mutable, can be modified after creation.
- Tuple: Immutable, once created, its contents cannot be changed.
In data analytics, tuples are preferred for storing constant data or when performance is critical, as they are faster due to immutability.
3. Explain the difference between deepcopy()
and copy()
in Python.
copy.copy()
creates a shallow copy, which means it copies the object but not the nested elements (they still reference the original).copy.deepcopy()
creates a deep copy, meaning it recursively copies all nested objects.
This distinction is important when working with complex data structures like nested lists or dictionaries to avoid unintended modifications.
4. What is the purpose of the zip()
function in Python?zip()
is used to combine multiple iterables (like lists or tuples) element-wise into a single iterator of tuples. It’s useful for parallel iteration, pairing values for analysis, or creating dictionaries from two lists.
5. How do you handle large datasets in Python that don’t fit in memory?
- Use generators and iterators instead of loading everything at once.
- Employ Dask, Vaex, or PySpark for out-of-core computation.
- Work in chunks using
pandas.read_csv()
with thechunksize
parameter. - Optimise memory usage by setting appropriate data types and dropping unnecessary columns.
NumPy
1. What are the key features of NumPy?
- Powerful N-dimensional array object:
ndarray
- Broadcasting functions
- Vectorised operations for performance
- Integration with C/C++ code
- Useful linear algebra and Fourier transform capabilities
2. How do you perform matrix multiplication using NumPy?
Use np.dot(A, B)
or A @ B
, where A
and B
are NumPy arrays. Both perform dot product/matrix multiplication, but @
is more readable and preferred in modern Python code.
3. How can you generate random numbers in NumPy?
Use numpy.random
module. For example:
pythonCopyEditnp.random.rand(3, 2) # Generates a 3x2 array of random floats between 0 and 1
You can also use np.random.randint()
for integers and np.random.normal()
for normally distributed data.
4. What is the difference between np.array()
and np.asarray()
?
np.array()
always creates a new array.np.asarray()
converts input to an array only if it is not already an array. This avoids unnecessary copying, improving efficiency when you’re unsure if the input is already a NumPy array.
5. What are broadcasting rules in NumPy?
Broadcasting allows operations on arrays of different shapes. NumPy automatically expands the smaller array to match the shape of the larger array, under certain compatibility conditions, without creating redundant copies in memory.
Pandas
1. What is the difference between loc[]
and iloc[]
in Pandas?
loc[]
: Label-based indexing (e.g.df.loc['row_label']
)iloc[]
: Integer-based indexing (e.g.df.iloc[0]
)
Useloc
when accessing rows/columns by name, andiloc
when working by index position.
2. How can you handle missing data using Pandas?
df.isnull()
ordf.info()
to detect missing valuesdf.dropna()
to remove rows/columns with missing valuesdf.fillna(value)
to replace missing values with a specified value or method like mean, median, forward-fill (method='ffill'
), etc.
3. How do you merge and join DataFrames in Pandas?
Use pd.merge()
for SQL-style joins (left
, right
, inner
, outer
), or df.join()
for simpler index-based joins. Example:
pythonCopyEditpd.merge(df1, df2, on='id', how='inner')
4. Explain the use of groupby()
in Pandas with an example.groupby()
splits data into groups based on a column and applies an aggregation function:
pythonCopyEditdf.groupby('department')['salary'].mean()
This returns average salary by department.
5. How can you optimise memory usage of a Pandas DataFrame?
- Convert data types (e.g.,
int64
toint32
) - Use
category
for columns with repeating string values - Drop unused columns
- Use chunked reading for large files
6. How do you perform a time series analysis in Pandas?
- Convert date columns using
pd.to_datetime()
- Set date as index using
df.set_index()
- Use
resample()
,rolling()
, andshift()
for moving averages, lags, etc.
Matplotlib / Seaborn (Data Visualisation)
1. What is the difference between Matplotlib and Seaborn?
Matplotlib is a low-level plotting library. Seaborn is built on top of Matplotlib and provides a high-level interface for creating more attractive and informative statistical graphics.
2. How do you create a boxplot using Seaborn?
pythonCopyEditimport seaborn as sns
sns.boxplot(x='category', y='value', data=df)
3. How can you customise legends, axes, and titles in Matplotlib?
Use plt.title()
, plt.xlabel()
, plt.ylabel()
, and plt.legend()
to customise elements. For advanced customisation, you can access Axes
and Figure
objects directly.
4. What types of plots are most useful for visualising distributions?
- Histogram
- Boxplot
- KDE plot
- Violin plot (Seaborn)
These help understand the spread, skewness, and outliers in a dataset.
Scikit-learn (Machine Learning & Preprocessing)
1. How do you handle categorical variables in Scikit-learn?
Categorical variables must be converted into numerical format before model training. The two most common approaches are:
- Label Encoding (
LabelEncoder
): Converts each category to a unique integer. - One-Hot Encoding (
OneHotEncoder
orpd.get_dummies()
): Creates a binary column for each category.
For tree-based models, label encoding may suffice, but for linear models, one-hot encoding is preferred to avoid ordinal assumptions.
2. Explain the use of train_test_split()
in Scikit-learn.train_test_split()
is used to divide a dataset into training and testing sets to evaluate model performance on unseen data.
Example:
pythonCopyEditfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
3. What is the difference between StandardScaler
and MinMaxScaler
?
- StandardScaler: Standardises features by removing the mean and scaling to unit variance (z-score scaling).
- MinMaxScaler: Scales features to a specific range, usually [0,1].
UseStandardScaler
when the data follows a Gaussian distribution; useMinMaxScaler
for algorithms sensitive to magnitude, like KNN or neural networks.
4. How do you perform feature selection in Scikit-learn?
- Filter methods: Using statistical tests like
SelectKBest
with chi-square or ANOVA - Wrapper methods: Using algorithms like recursive feature elimination (RFE)
- Embedded methods: Algorithms like Lasso or tree-based models that rank features based on importance
5. What are pipelines in Scikit-learn, and why are they useful?
Pipelines allow the sequential chaining of preprocessing and modelling steps. They help ensure that the same transformations applied during training are also applied during inference.
Example:
pythonCopyEditfrom sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
Statsmodels
1. What is the use of the Statsmodels library in Python?
Statsmodels is used for statistical modelling. It provides classes and functions for linear models, GLMs, time series analysis, and hypothesis testing — often with detailed statistical output (p-values, R², AIC/BIC, confidence intervals).
2. How do you interpret the summary output of an OLS regression model using Statsmodels?
The summary()
output includes:
- Coefficients: Estimates of feature impact
- p-values: Help test if coefficients are statistically significant
- R-squared / Adjusted R-squared: Model fit
- F-statistic: Tests the overall significance of the model
Low p-values (< 0.05) usually indicate significant predictors.
3. What are some advantages of using Statsmodels over Scikit-learn?
Statsmodels offers:
- More detailed statistical output
- In-built support for hypothesis testing
- Model diagnostics and confidence intervals
Scikit-learn, however, is better for predictive modelling and pipelines.
Other Relevant Python Libraries
1. What is the use of the Dask library in data analytics?
Dask provides parallel computing and out-of-core dataframes that can handle large datasets using familiar Pandas syntax. It’s useful for big data tasks that cannot fit into memory.
2. How does Vaex help in handling large datasets?
Vaex is optimised for lazy evaluation and memory mapping of large HDF5/Arrow datasets. It allows fast filtering, aggregations, and visualisation without loading the full data into RAM.
3. How do you use PySpark for big data analytics in Python?
PySpark is the Python API for Apache Spark. It supports distributed data processing, useful for analysing huge datasets across clusters. You can use Spark SQL, DataFrames, and MLlib for scalable analytics.
4. What is Plotly and when would you use it over Matplotlib?
Plotly is a graphing library that creates interactive visualisations. Use Plotly when you need web-based, zoomable, and shareable charts, dashboards, or real-time visualisations.
ETL & Data Engineering Tools
1. What is Airflow, and how is it used in a data pipeline?
Apache Airflow is an orchestration tool to programmatically schedule and monitor workflows. You define Directed Acyclic Graphs (DAGs) in Python to represent ETL jobs or batch processes.
2. How do you connect and extract data from SQL databases using Python?
Use libraries like SQLAlchemy
, pyodbc
, or psycopg2
for database connections, combined with Pandas:
pythonCopyEditimport pandas as pd
df = pd.read_sql('SELECT * FROM table_name', con=connection)
3. What is Openpyxl or XlsxWriter, and how do they differ from Pandas Excel methods?
These libraries allow fine-grained formatting and writing of Excel files. While Pandas’ to_excel()
is easy, Openpyxl
and XlsxWriter
provide advanced features like formatting, charts, and conditional styling.
4. How do you schedule and automate Python scripts for reporting tasks?
- Cron (Linux) or Task Scheduler (Windows)
- Python packages like
schedule
orAPScheduler
- Airflow for DAG-based orchestration
- Cloud tools (e.g. AWS Lambda + EventBridge)
Performance Optimisation & Debugging
5. How do you identify and fix bottlenecks in your data analysis pipeline?
- Use profiling tools like
cProfile
,line_profiler
- Monitor memory with
memory_profiler
- Optimise data types in Pandas
- Avoid loops; use vectorised operations
- Use chunking or Dask/Vaex for large datasets
6. What tools or techniques do you use to profile Python code?
cProfile
,line_profiler
,timeit
- Visual tools like SnakeViz for visualising execution time
- Memory profiling with
memory_profiler
ortracemalloc
7. How can vectorisation improve performance in NumPy and Pandas?
Vectorisation leverages low-level optimised C operations, eliminating slow Python loops. Operations are applied on entire arrays/dataframes simultaneously, reducing computation time and improving performance.

13+ Yrs Experienced Career Counsellor & Skill Development Trainer | Educator | Digital & Content Strategist. Helping freshers and graduates make sound career choices through practical consultation. Guest faculty and Digital Marketing trainer working on building a skill development brand in Softspace Solutions. A passionate writer in core technical topics related to career growth.