Karthik Adari

10 Data Analyst Projects from GitHub That Can Make Your Resume Stand Out

Karthik Adari — Wed, 29 Apr 2026 22:01:04 GMT

Most data analyst resumes look the same because they only mention tools.

But strong resumes show business problems, KPIs, dashboards, SQL analysis, insights, and measurable impact.

So here are 10 GitHub projects that are worth studying, rebuilding, and customizing for your own resume. These projects cover different analyst skills like Excel, SQL, Python, Tableau, Power BI, finance analytics, HR analytics, customer behavior, data warehousing, and dashboard storytelling.

1. Pizza Sales Analysis Project

Github Link - https://github.com/ekaterinakham/SQL-Tableau-PowerBI-Excel-Pizza-Sales-Analysis-Project

Skills Covered - SQL, Excel, Power BI, Tableau, KPI dashboarding, sales analysis, revenue analysis

Why it’s strong for resume -
This is one of the best all-in-one beginner-to-intermediate projects because it covers SQL queries, Excel dashboarding, Power BI dashboarding, Tableau dashboarding, sales KPIs, best/worst sellers, revenue trends, and business performance analysis. It shows that you can analyze the same business problem across multiple tools.

Resume points -
i. Built an end-to-end pizza sales analytics project using SQL, Excel, Power BI, and Tableau to analyze revenue, order volume, product performance, and customer demand trends.
ii. Created interactive dashboards tracking 10+ KPIs including total revenue, average order value, total pizzas sold, daily trends, monthly trends, and category-level performance.
iii. Wrote SQL queries to identify top-selling and low-performing pizza categories, enabling data-driven recommendations for menu optimization and sales strategy.
iv. Designed multi-tool dashboard reports that reduced manual sales review effort by organizing key business metrics into one clear reporting workflow.

2. HR Analytics Project

Github Link - https://github.com/ekaterinakham/PowerBI-Tableau-SQL-Excel-HR-Analytics-Project

Skills Covered - HR analytics, SQL, Excel, Power BI, Tableau, workforce reporting, attrition analysis

Why it’s strong for resume -
This is a great project for people analytics and HR reporting roles. It covers employee metrics, attrition-style workforce analysis, Excel reporting, SQL documentation, Power BI dashboards, and Tableau dashboards. It is especially strong because HR analytics is used in almost every company.

Resume points -
i. Developed an HR analytics dashboard using SQL, Excel, Power BI, and Tableau to monitor employee count, workforce distribution, attrition patterns, and department-level trends.
ii. Analyzed employee data across multiple dimensions such as gender, department, job role, education field, and age group to identify workforce risk areas.
iii. Built 8+ HR KPIs and visual reports to help stakeholders understand attrition drivers, employee demographics, and retention opportunities.
iv. Transformed raw HR data into executive-ready dashboards, improving visibility into workforce health and supporting data-backed HR decisions.

3. Bank Loan Lending Data Analytics

Github Link - https://github.com/arnavchaturvedi17/Data-Analysis-Bank-Loan-Lending_Data_Analytics

Skills Covered - Finance analytics, SQL, Tableau, Excel, loan analysis, KPI tracking, risk reporting

Why it’s strong for resume -
This is a strong finance-domain project. It includes loan applications, funded amount, amount received, interest rate, loan status, loan purpose, SQL ETL, Tableau dashboarding, and Excel validation. This is excellent for data analyst, financial analyst, risk analyst, and business analyst resumes.

Resume points -
i. Analyzed bank loan lending data using SQL, Excel, and Tableau to evaluate loan applications, funded amounts, repayment performance, and borrower risk segments.
ii. Built financial KPI dashboards tracking 10+ metrics including total loan applications, funded amount, amount received, interest rate, debt-to-income ratio, and loan status.
iii. Used SQL to segment good loans versus bad loans and identify patterns across loan purpose, term, grade, employment length, and borrower profile.
iv. Created Tableau dashboards to support lending performance review and risk monitoring, helping translate raw loan records into clear business insights.

4. OLA Data Analyst Project

Github Link - https://github.com/PrajwalGpy/OLA-Data-Analyst-Project-Power-BI-And-SQL

Skills Covered - SQL, Power BI, ride-booking analytics, customer analytics, driver performance, revenue reporting

Why it’s strong for resume -
This is a solid business analytics project because it analyzes ride volumes, booking status, revenue by payment method, customer behavior, driver ratings, vehicle performance, and cancellation trends. It feels like a real analytics project from a marketplace or transportation company.

Resume points -
i. Built an OLA ride-booking analytics project using SQL and Power BI to analyze booking volume, revenue trends, cancellation patterns, and ride completion performance.
ii. Created dashboard views for 10+ operational KPIs including total bookings, successful rides, cancelled rides, revenue by payment method, vehicle type performance, and customer ratings.
iii. Used SQL queries to identify customer and driver behavior trends, including cancellation reasons, ride frequency, and revenue contribution by ride category.
iv. Delivered Power BI insights to support marketplace operations, customer experience improvement, and driver performance monitoring.

5. Customer Shopping Behavior Analytics

Github Link - https://github.com/amlanmohanty1/customer-trends-data-analysis-SQL-Python-PowerBI

Skills Covered - Python, SQL, Power BI, customer analytics, EDA, reporting, business presentation

Why it’s strong for resume -
This is one of the best end-to-end projects because it includes data import, exploratory analysis, cleaning, SQL loading, business question analysis, Power BI dashboarding, and reporting. It shows a full analyst workflow rather than only a dashboard.

Resume points -
i. Completed an end-to-end customer shopping behavior analysis using Python, SQL, and Power BI to uncover purchasing patterns, customer segments, and product trends.
ii. Cleaned and prepared customer transaction data using Python, improving dataset consistency before loading structured tables into SQL for analysis.
iii. Answered 15+ business questions using SQL, including sales trends, customer demographics, purchase frequency, product preferences, and revenue drivers.
iv. Built a Power BI dashboard and final business report to summarize key insights, helping connect customer behavior patterns with actionable retail recommendations.

6. Cyclistic Bike Share Case Study

Github Link - https://github.com/SomiaNasir/Google-Data-Analytics-Capstone-Cyclistic-Case-Study

Skills Covered - SQL, BigQuery, Tableau, business case study, customer behavior analysis, data storytelling

Why it’s strong for resume -
This project is strong because it follows a structured analyst process: Ask, Prepare, Process, Analyze, Share, and Act. It also includes SQL queries and Tableau visualizations. It is a good project for entry-level data analyst resumes because it shows business thinking, not just technical skills.

Resume points -
i. Conducted a Cyclistic bike-share case study using SQL and Tableau to compare usage behavior between casual riders and annual members.
ii. Processed and analyzed 12 months of trip data to identify patterns in ride duration, weekday usage, seasonal demand, and customer segment behavior.
iii. Built Tableau dashboards to visualize member conversion opportunities, peak usage periods, and differences in riding habits across customer groups.
iv. Recommended data-backed marketing strategies to increase annual memberships by targeting high-frequency casual riders and weekend-heavy users.

7. SQL Data Warehouse and Analytics Project

Github Link - https://github.com/DataWithBaraa/sql-data-warehouse-project

Skills Covered - SQL Server, ETL, data warehousing, star schema, data modeling, reporting, business analytics

Why it’s strong for resume -
This project can make a resume stand out because it goes beyond regular dashboarding. It covers data warehouse architecture, ETL, bronze/silver/gold layers, fact and dimension modeling, data quality checks, and SQL-based reporting. This is especially useful for data analyst, BI analyst, and analytics engineer roles.

Resume points -
i. Designed a SQL-based data warehouse using bronze, silver, and gold layers to transform raw sales data into structured analytics-ready tables.
ii. Built ETL workflows and data quality checks to clean, standardize, and validate customer, product, and sales datasets before reporting.
iii. Created fact and dimension tables using star schema modeling to support scalable reporting across customer behavior, product performance, and sales trends.
iv. Developed SQL analytics reports covering revenue trends, customer segmentation, product performance, and business growth metrics for BI use cases.

8. Data Analysis Portfolio by Rebekah

Github Link - https://github.com/rebekah999/Data-Analysis-Portfolio

Skills Covered - PostgreSQL, Excel, Python, EDA, sales analysis, inventory analysis, churn analysis

Why it’s strong for resume -
This is a strong reference portfolio because it includes multiple analyst-style projects. It covers SQL analysis, Excel exploration, property sales dashboards, S&P 500 data pipeline work, and employee churn analysis. It is helpful if you want to understand how to organize several projects in one GitHub portfolio.

Resume points -
i. Built a multi-project data analysis portfolio covering SQL, Excel, Python, sales analytics, inventory analysis, employee churn, and financial market data.
ii. Used PostgreSQL to analyze business datasets across orders, revenue, customers, inventory, and employee performance, answering 20+ analytical questions.
iii. Created Excel and dashboard-based reports to summarize sales trends, product performance, customer behavior, and operational efficiency.
iv. Organized multiple analysis projects into a clean GitHub portfolio structure, improving project readability for recruiters and hiring managers.

9. Maven Toys Sales Project Analysis

Github Link - https://github.com/Yash-Yennewar/Maven_Toys_Sales_Project_Analysis

Skills Covered - Power BI, DAX, data modeling, retail analytics, inventory analysis, sales performance

Why it’s strong for resume -
This is a good Power BI portfolio project. It uses a realistic retail dataset and focuses on revenue, profit, inventory efficiency, store performance, DAX calculations, relationships, maps, slicers, and business storytelling. This is a strong choice for anyone targeting BI analyst or Power BI analyst roles.

Resume points -
i. Developed a Power BI sales analytics dashboard for Maven Toys to monitor revenue, profit, store performance, product demand, and inventory efficiency.
ii. Built DAX measures and data model relationships to calculate 10+ business KPIs including total sales, profit margin, units sold, stock levels, and store-level performance.
iii. Analyzed product and location-level trends to identify high-performing stores, slow-moving products, and inventory optimization opportunities.
iv. Designed an interactive retail dashboard with slicers, maps, and category-level drilldowns to support faster business performance review.

10. Alex The Analyst Portfolio Projects

Github Link - https://github.com/AlexTheAnalyst/PortfolioProjects

Skills Covered - SQL, Python, data cleaning, Tableau, web scraping, API extraction, EDA

Why it’s strong for resume -
This repo is popular and useful for learning project structure. It includes SQL exploration, Nashville housing data cleaning, Tableau SQL queries, Python notebooks, web scraping, and API extraction. Use it as a reference, but customize your own version because many candidates already use this repo.

Resume points -
i. Completed multiple portfolio projects using SQL, Python, Tableau, web scraping, and API extraction to demonstrate end-to-end data analysis skills.
ii. Cleaned and transformed real-world datasets using SQL, including handling missing values, standardizing fields, removing duplicates, and preparing data for visualization.
iii. Performed exploratory analysis using SQL and Python to identify trends, patterns, and business insights across housing, COVID, and public datasets.
iv. Built Tableau-ready datasets and dashboards to communicate findings clearly through visual storytelling and stakeholder-friendly reporting.

DEMO RESUME - Link

Ending Note

A strong data analyst resume does not need 20 projects.

It needs 3 solid projects that show:

SQL thinking,
dashboard storytelling,
business understanding,
clean data work,
and measurable insights.

My recommendation:

Build one project with SQL + Python,
one project with Power BI or Tableau,
and one project from a real business domain like finance, HR, retail, healthcare, or customer analytics.

Don’t just copy these GitHub projects. Rebuild them, improve the dashboards, add your own insights, and write stronger resume bullets around the business impact.

Netflix Data Science Interview Questions: 18 Problems That Test How You Think

Karthik Adari — Wed, 29 Apr 2026 15:54:41 GMT

Most analytics and data science interviews are not just testing formulas.

They are testing whether you can take a messy business problem, slow it down, structure it, make reasonable assumptions, and explain your thinking without getting lost.

Here are detailed, human-friendly answers to 18 common interview questions across fraud, A/B testing, regression, SQL, product analytics, experimentation, and Netflix-style business cases.

① Given a month’s worth of login data with account_id, device_id, and payment-related metadata, how would you detect fraud?

I would not start by building a model immediately.

I would first ask: What type of fraud are we trying to catch?

For example:

Account takeover
Fake accounts
Shared or resold accounts
Payment abuse
Stolen card usage
Promo abuse
Refund abuse
Bot-driven login behavior

Then I would structure the data around three main entities:

Account level

How many devices used the same account?
How many payment methods were attached?
How many failed payments happened?
Did login location or device suddenly change?
Was there a sudden spike in activity?

Device level

How many accounts used the same device?
Did one device create or access many accounts?
Does the device switch between many payment methods?
Is the device linked to accounts with failed payments or chargebacks?

Payment level

Are multiple accounts using the same card?
Are there many failed payments before success?
Are billing country and login country very different?
Are disposable cards or suspicious BIN patterns involved?
Is the same payment method reused across unrelated accounts?

A few useful features:

accounts_per_device
devices_per_account
payment_methods_per_account
accounts_per_payment_method
failed_payment_count
chargeback_count
login_country_count
new_device_login_flag
velocity_of_logins
time_between_account_creation_and_payment
payment_failure_rate

I would also use graph thinking.

For example:

Account A uses Device X
Device X also uses Account B, C, D, and E
Account B and C both used the same payment method
Account D had a chargeback

That cluster is more suspicious than looking at one login row alone.

A simple fraud score could look like this:

WITH device_usage AS (
    SELECT
        device_id,
        COUNT(DISTINCT account_id) AS accounts_per_device
    FROM logins
    GROUP BY device_id
),

account_usage AS (
    SELECT
        account_id,
        COUNT(DISTINCT device_id) AS devices_per_account
    FROM logins
    GROUP BY account_id
),

payment_usage AS (
    SELECT
        payment_method_id,
        COUNT(DISTINCT account_id) AS accounts_per_payment
    FROM payments
    GROUP BY payment_method_id
),

account_payment AS (
    SELECT
        account_id,
        COUNT(*) AS total_payments,
        SUM(CASE WHEN payment_status = 'failed' THEN 1 ELSE 0 END) AS failed_payments,
        SUM(CASE WHEN payment_status = 'chargeback' THEN 1 ELSE 0 END) AS chargebacks
    FROM payments
    GROUP BY account_id
)

SELECT
    l.account_id,
    MAX(d.accounts_per_device) AS max_accounts_per_device,
    a.devices_per_account,
    MAX(pu.accounts_per_payment) AS max_accounts_per_payment,
    ap.failed_payments,
    ap.chargebacks,
    CASE
        WHEN MAX(d.accounts_per_device) >= 5 THEN 1 ELSE 0
    END AS shared_device_flag,
    CASE
        WHEN a.devices_per_account >= 4 THEN 1 ELSE 0
    END AS many_devices_flag,
    CASE
        WHEN ap.failed_payments >= 3 THEN 1 ELSE 0
    END AS payment_failure_flag
FROM logins l
LEFT JOIN device_usage d
    ON l.device_id = d.device_id
LEFT JOIN account_usage a
    ON l.account_id = a.account_id
LEFT JOIN payments p
    ON l.account_id = p.account_id
LEFT JOIN payment_usage pu
    ON p.payment_method_id = pu.payment_method_id
LEFT JOIN account_payment ap
    ON l.account_id = ap.account_id
GROUP BY
    l.account_id,
    a.devices_per_account,
    ap.failed_payments,
    ap.chargebacks;

After that, I would decide whether to use:

Rule-based detection for obvious fraud
Anomaly detection for unknown patterns
Supervised ML if we have labels like confirmed fraud, chargeback, banned account
Graph-based detection if fraud happens in connected groups

Most importantly, I would measure precision and recall.

In fraud, catching everything sounds good, but false positives can hurt real customers. So I would not only ask, “How much fraud did we catch?” I would also ask, “How many good users did we block?”

② What are the assumptions of A/B testing?

A/B testing looks simple, but it depends on many assumptions.

The main assumptions are:

1. Random assignment

Users should be randomly assigned to control and treatment. If one group gets more new users or more high-value users, the test becomes biased.

2. Independence

One user’s treatment should not affect another user’s outcome. This is also called no interference.

For example, if testing a social feature, one user seeing the feature may affect their friends. That breaks independence.

3. Stable experience

Control users should actually see control. Treatment users should actually see treatment. Bugs, logging issues, or delayed feature exposure can ruin the test.

4. Same measurement logic

Metrics should be calculated the same way for both groups.

If revenue is logged differently in treatment, the result may look significant even if the product did not improve.

5. No sample ratio mismatch

If the test is supposed to be 50 percent control and 50 percent treatment, the actual split should be close to that. If it becomes 70 and 30, something may be wrong.

6. Enough time to capture behavior

Some metrics need time.

A button click may show up quickly. Retention, refunds, churn, or subscription renewal needs more time.

7. Metrics should be decided before the test

If we keep checking 30 metrics and only report the one that looks good, we may fool ourselves.

8. No repeated peeking without correction

Checking results every hour and stopping when the p-value looks good increases false positives.

9. The test population should match the decision

If we only test on new users, we should be careful before applying the result to all users.

A/B testing is not just about splitting traffic. It is about making sure the comparison is fair.

③ If you have one day of experiment data, large sample size, and significant results, would you stop the experiment?

Usually, no.

A large sample size and a significant result after one day are not enough.

I would first check:

Was the experiment planned to run only one day?
Is the metric immediate or delayed?
Are there weekday or weekend effects?
Is there novelty effect?
Is there a sample ratio mismatch?
Are guardrail metrics healthy?
Did logging work correctly?
Are new and returning users reacting differently?
Did the result hold across important segments?
Did we account for repeated peeking?

One day can be misleading.

For example, a new homepage design may increase clicks on day one because users are curious. But after a week, the effect may disappear.

Or a pricing experiment may look good after one day because revenue per visitor increased, but refunds or cancellations may show up later.

I would stop early only if:

The test had a pre-defined early stopping rule
There is a strong harm signal
There is a clear business reason to stop
Sequential testing methods were used
The effect is extremely large and stable across checks

My interview answer would be:

“I would not stop just because the p-value is significant after one day. I would check the experiment design, guardrail metrics, sample ratio, novelty effects, and whether the metric needs more time. Unless early stopping was planned, I would continue until the pre-defined duration or decision rule is reached.”

④ How do you know if one algorithm is better than another?

An algorithm is not “better” in general. It is better for a specific goal.

I would compare algorithms using four layers.

1. Offline model performance

For classification:

Accuracy
Precision
Recall
F1 score
AUC
Log loss
Calibration

For regression:

MAE
RMSE
MAPE
R²
Residual behavior

But the metric should match the business problem.

For fraud detection, recall may matter because missing fraud is expensive.

For spam detection, precision may matter because blocking real messages is painful.

2. Validation setup

I would make sure both algorithms are trained and tested fairly.

That means:

Same train and test split
No data leakage
Same features
Same evaluation period
Cross-validation if needed
Time-based split for time-sensitive data

3. Business impact

A model can have better AUC but worse business value.

For example, one model may improve fraud detection by 2 percent but create too many false positives. Another may have slightly lower AUC but better customer experience.

So I would ask:

Does it improve revenue?
Does it reduce risk?
Does it reduce manual review?
Does it improve user experience?
Does it meet latency requirements?

4. Practical constraints

I would also compare:

Training time
Prediction speed
Interpretability
Maintenance cost
Stability over time
Fairness across user groups
Ease of deployment

The best answer is not always the fanciest model.

Sometimes logistic regression is better than a deep model because it is fast, stable, explainable, and good enough.

⑤ How do you interpret R² in regression?

R² tells us how much of the variation in the target variable is explained by the model.

For example, if R² is 0.72, I would say:

“The model explains 72 percent of the variation in the outcome using the features included in the model.”

It does not mean:

The model is 72 percent accurate
The model is correct 72 percent of the time
The features cause 72 percent of the outcome
The model will perform well on new data automatically

A high R² can still be bad if:

There is data leakage
The model overfits
Important assumptions are broken
Residuals show patterns
The model performs poorly on new data

A low R² is not always useless either.

In human behavior, marketing, finance, and social data, outcomes are noisy. A lower R² can still support useful decisions if the model improves forecasting or ranking.

I would also look at:

Adjusted R²
RMSE or MAE
Residual plots
Out-of-sample performance
Whether the model makes business sense

Simple interview line:

“R² measures explained variance, not accuracy or causality. I would use it with error metrics and validation performance before trusting the model.”

⑥ What is FDR? What are the pitfalls in multiple testing?

FDR means False Discovery Rate.

It is the expected proportion of false positives among the results we call significant.

Example:

If we test 100 features and declare 20 significant, an FDR of 10 percent means we expect around 2 of those 20 discoveries to be false positives.

This matters because when we run many tests, some will look significant just by chance.

If we use p < 0.05 and run 100 independent tests, we may get around 5 false positives even if nothing real is happening.

Common pitfalls:

1. Testing too many metrics

If we check revenue, clicks, watch time, retention, churn, refund rate, search usage, and 50 segments, something will look significant.

2. Looking at results repeatedly

If we keep checking daily and stop when p < 0.05, we increase the chance of a false positive.

3. Changing hypotheses after seeing results

This is common in real life. People look at the data, find a nice story, then act like that was the original hypothesis.

4. Segment fishing

The overall result may be neutral, but one tiny segment looks amazing. That could be noise.

5. Ignoring dependency between tests

Metrics are often related. Clicks, sessions, engagement, and retention are not fully independent.

Ways to handle it:

Pre-define primary and secondary metrics
Use Benjamini-Hochberg correction for FDR
Use Bonferroni when you want stricter control
Avoid making big decisions from small segments
Validate surprising findings in a follow-up experiment
Separate exploratory analysis from confirmatory analysis

Simple interview answer:

“FDR controls the proportion of false discoveries among significant findings. The main pitfall in multiple testing is that the more tests we run, the more likely we are to find something significant by luck.”

⑦ Explain regression coefficients, R², Type I error, and Type II error.

Let’s take them one by one.

Regression coefficients

A regression coefficient tells us the expected change in the target variable when one feature increases by one unit, holding other variables constant.

Example:

If we predict monthly spend and the coefficient for sessions_per_month is 3.5, then one extra session is associated with 3.5 more dollars in monthly spend, assuming other variables stay the same.

For a binary variable:

If premium_user = 1 has a coefficient of 20, premium users spend 20 dollars more on average than non-premium users, holding other features constant.

Important note:

Regression coefficients show association, not automatically causation.

R²

R² tells us how much variation in the target variable is explained by the model.

If R² is 0.60, the model explains 60 percent of the variation in the outcome.

It does not mean the model is 60 percent accurate.

Type I error

Type I error means false positive.

We say there is an effect when there is actually no effect.

Example:

We say a new checkout page increases purchases, but in reality, it does not.

Type II error

Type II error means false negative.

We fail to detect an effect even though there is a real effect.

Example:

A new recommendation model actually improves retention, but our test does not detect it because the sample size was too small.

Simple memory:

Type I error: false alarm
Type II error: missed signal

In business, both matter.

A Type I error can make us launch a bad feature.

A Type II error can make us reject a good feature.

⑧ Explain ETL considerations for Big Data.

For Big Data ETL, I would think beyond just moving data from one place to another.

I would consider the full lifecycle: ingestion, transformation, storage, quality, cost, security, and monitoring.

1. Data volume

How much data are we processing?

Millions of rows?
Billions of rows?
Terabytes per day?
Streaming or batch?

This affects whether we use tools like Spark, Flink, BigQuery, Snowflake, Databricks, Kafka, or cloud storage.

2. Data velocity

Is the data coming in real time or once per day?

For example:

Login events may need near real-time processing
Finance reports may be fine as daily batch jobs

3. Data variety

Data can come from many formats:

CSV
JSON
Parquet
Avro
Logs
APIs
Database tables
Event streams

The ETL design should handle schema changes and messy fields.

4. Partitioning

Good partitioning improves speed and lowers cost.

Common partitions:

Date
Region
Product
Event type

Bad partitioning can create slow queries or too many small files.

5. Incremental processing

We should avoid reprocessing everything daily if only new records changed.

Useful approaches:

Incremental loads
Change Data Capture
Watermarks
Last updated timestamps

6. Late-arriving data

In Big Data systems, events do not always arrive on time.

For example, a purchase event may arrive today but belong to yesterday.

The pipeline should support late data correction.

7. Data quality checks

I would add checks for:

Null values
Duplicate rows
Invalid IDs
Negative revenue
Future timestamps
Broken joins
Unexpected row count changes

8. Idempotency

If a job runs twice, it should not double-count data.

This is very important.

A pipeline should be safe to rerun.

9. Monitoring

I would monitor:

Job failures
Runtime
Row counts
Freshness
Data drift
Cost spikes
Schema changes

10. Security and privacy

Payment, user, and personal data need careful handling.

I would consider:

Encryption
Access control
PII masking
Audit logs
Retention policies

Simple interview answer:

“Big Data ETL is not just about extracting and loading data. I would think about scale, partitioning, incremental loads, late data, data quality, idempotency, monitoring, cost, and security.”

⑨ What is a p-value? What does 0.05 actually mean?

A p-value tells us how surprising the observed result is if the null hypothesis were true.

The null hypothesis usually says there is no effect.

So if p-value = 0.05, it means:

“If there were truly no effect, there is a 5 percent chance of seeing a result this extreme or more extreme due to random chance.”

It does not mean:

There is a 95 percent chance the treatment works
There is a 5 percent chance the result is false
The effect is important
The result is practically meaningful
The experiment was designed correctly

A tiny p-value can happen with a huge sample size even when the effect is very small.

For example, a button color change may increase click rate from 10.00 percent to 10.03 percent. With millions of users, that may be statistically significant, but maybe not worth launching.

I would always look at:

Effect size
Confidence interval
Business impact
Experiment design
Guardrail metrics
Whether the test was pre-planned
Whether multiple testing was handled

Simple interview answer:

“A p-value of 0.05 means the observed result would be rare under the no-effect assumption. It does not prove the treatment works. I would also look at effect size, confidence intervals, and business value.”

⑩ Solve a SQL problem with multiple constraints, joins, and averages.

Let’s use a realistic example.

Problem

You have three tables:

users

user_id
signup_date
country

orders

order_id
user_id
order_date
order_amount
status

sessions

session_id
user_id
session_date
device_type

Question:

Find the average completed order amount per active user in March 2026, by country, only for countries with at least 100 active users.

An active user is a user with at least one session in March 2026.

Only include completed orders from March 2026.

SQL

WITH march_active_users AS (
    SELECT DISTINCT
        user_id
    FROM sessions
    WHERE session_date >= DATE '2026-03-01'
      AND session_date < DATE '2026-04-01'
),

march_completed_orders AS (
    SELECT
        user_id,
        SUM(order_amount) AS total_order_amount,
        COUNT(order_id) AS completed_order_count
    FROM orders
    WHERE order_date >= DATE '2026-03-01'
      AND order_date < DATE '2026-04-01'
      AND status = 'completed'
    GROUP BY user_id
),

user_level AS (
    SELECT
        u.user_id,
        u.country,
        COALESCE(o.total_order_amount, 0) AS total_order_amount,
        COALESCE(o.completed_order_count, 0) AS completed_order_count
    FROM users u
    JOIN march_active_users a
        ON u.user_id = a.user_id
    LEFT JOIN march_completed_orders o
        ON u.user_id = o.user_id
)

SELECT
    country,
    COUNT(DISTINCT user_id) AS active_users,
    SUM(total_order_amount) AS total_revenue,
    SUM(completed_order_count) AS completed_orders,
    AVG(total_order_amount) AS avg_order_amount_per_active_user
FROM user_level
GROUP BY country
HAVING COUNT(DISTINCT user_id) >= 100
ORDER BY avg_order_amount_per_active_user DESC;

How I would explain it

I first identify active users from the sessions table.

Then I calculate completed orders for March at the user level.

Then I join users to active users and left join orders, because active users with zero orders should still be included.

Finally, I aggregate by country and filter to countries with at least 100 active users.

The key detail is the left join. If I used an inner join to orders, I would only include users who purchased, which would inflate the average.

⑪ Work through a product analytics case study.

Let’s say the case is:

“Netflix sees a drop in weekly watch time. How would you investigate?”

I would start by breaking the metric down.

Weekly watch time can drop because:

Fewer users are active
Active users are watching less
Sessions are shorter
Fewer sessions per user
Content discovery is worse
Playback issues increased
New content quality is weaker
Pricing or account changes affected behavior
A specific region or device is causing the drop

I would decompose it like this:

Total Watch Time =
Active Users
x Sessions per Active User
x Plays per Session
x Minutes Watched per Play

Then I would check:

1. Is this a real drop or a data issue?

Did event logging change?
Did a pipeline fail?
Is watch time missing for some devices?
Did the definition of watch time change?

2. Where is the drop happening?

Break down by:

Country
Device type
New vs returning users
Plan type
App version
Content category
Acquisition channel

3. When did it start?

Check if the drop aligns with:

Product release
Pricing change
Content release schedule
Competitor event
Holiday
Sports event
App outage

4. Which part of the funnel changed?

For streaming:

App open rate
Homepage impressions
Title clicks
Play starts
Playback errors
Completion rate
Search usage
Recommendation CTR

5. What actions would I recommend?

If discovery dropped, I would look at recommendations and homepage ranking.

If playback errors increased, I would escalate to engineering.

If content engagement dropped, I would analyze catalog freshness and title-level performance.

If only new users dropped, I would inspect onboarding.

A strong product analytics answer always moves from metric to diagnosis to action.

⑫ Explain deeper A/B testing ideas like variance, bootstrap, covariate adjustment, and treatment effects.

Variance

Variance tells us how noisy a metric is.

High variance makes it harder to detect a real effect.

For example, revenue per user usually has high variance because a few users spend a lot and many spend nothing.

If variance is high, we may need:

Larger sample size
Longer experiment duration
Better metric design
Winsorization for extreme outliers
Covariate adjustment

Bootstrap

Bootstrap is a resampling method.

Instead of relying only on formulas, we repeatedly sample from the data with replacement and estimate the metric many times.

This gives us an empirical distribution of the metric.

Bootstrap is useful when:

The metric is not normally distributed
The formula for standard error is messy
The metric is a ratio
We want confidence intervals in a practical way

Example:

If we want confidence intervals for revenue per user, bootstrap can help because revenue data is often skewed.

Covariate adjustment

Covariate adjustment uses pre-experiment information to reduce noise.

For example, if we know each user’s watch time before the experiment, we can adjust for it.

This helps because users are naturally different.

Some users watch a lot. Some barely watch. If we control for past behavior, the treatment effect estimate can become more precise.

Common examples:

Pre-period revenue
Pre-period engagement
User tenure
Country
Device type
Plan type

One popular method is CUPED, which adjusts the outcome using pre-experiment behavior.

Treatment effects

The treatment effect is the difference between treatment and control.

Basic version:

Treatment Effect = Average outcome in treatment - Average outcome in control

But we may also care about:

Average Treatment Effect

The overall average impact across all users.

Heterogeneous Treatment Effect

The effect is different across groups.

For example:

New users benefit
Existing users do not
Mobile users benefit
TV users do not

Intent-to-treat effect

This measures users based on assigned group, even if they did not fully experience the treatment.

This preserves randomization.

Treatment-on-treated effect

This measures the effect only among users who actually received or used the treatment.

This can be useful, but it may introduce bias if not handled carefully.

⑬ Solve a live SQL + metrics case around a gift card program.

Problem

A company launched a gift card program.

Tables:

gift_cards

gift_card_id
buyer_user_id
recipient_user_id
purchase_date
gift_card_amount

redemptions

redemption_id
gift_card_id
redeem_date
redeemed_amount

orders

order_id
user_id
order_date
order_amount
payment_type

Questions:

What percent of gift cards are redeemed within 30 days?
What is the average redeemed amount?
Do recipients spend more than the gift card amount?

SQL

WITH gift_card_base AS (
    SELECT
        gift_card_id,
        buyer_user_id,
        recipient_user_id,
        purchase_date,
        gift_card_amount
    FROM gift_cards
),

redemption_summary AS (
    SELECT
        gift_card_id,
        MIN(redeem_date) AS first_redeem_date,
        SUM(redeemed_amount) AS total_redeemed_amount
    FROM redemptions
    GROUP BY gift_card_id
),

recipient_orders_after_gift AS (
    SELECT
        g.gift_card_id,
        SUM(o.order_amount) AS recipient_total_spend_after_gift
    FROM gift_card_base g
    JOIN orders o
        ON g.recipient_user_id = o.user_id
       AND o.order_date >= g.purchase_date
       AND o.order_date < g.purchase_date + INTERVAL '30 days'
    GROUP BY g.gift_card_id
)

SELECT
    COUNT(*) AS total_gift_cards,

    AVG(
        CASE
            WHEN r.first_redeem_date IS NOT NULL
             AND r.first_redeem_date < g.purchase_date + INTERVAL '30 days'
            THEN 1.0 ELSE 0.0
        END
    ) AS redemption_rate_30d,

    AVG(COALESCE(r.total_redeemed_amount, 0)) AS avg_redeemed_amount,

    AVG(COALESCE(o.recipient_total_spend_after_gift, 0)) AS avg_recipient_spend_30d,

    AVG(
        COALESCE(o.recipient_total_spend_after_gift, 0) - g.gift_card_amount
    ) AS avg_incremental_spend_above_gift_card

FROM gift_card_base g
LEFT JOIN redemption_summary r
    ON g.gift_card_id = r.gift_card_id
LEFT JOIN recipient_orders_after_gift o
    ON g.gift_card_id = o.gift_card_id;

Metrics I would track

For a gift card program, I would not only track sales.

I would track:

Gift card purchase volume
Redemption rate
Time to redemption
Breakage rate, meaning unused value
Recipient activation rate
New user recipients
Repeat purchase rate after redemption
Incremental spend above gift amount
Buyer repeat gift purchase rate
Fraud or abuse rate

The most important business question is:

“Is the gift card program creating incremental customer value, or just shifting existing spend into gift card form?”

That means we should compare recipients against a similar group of non-recipients, or run an experiment if possible.

⑭ Design an A/B test and metric framework for hiring linguists for subtitles.

Let’s say a streaming platform wants to improve subtitle quality by hiring more professional linguists.

The product question:

“Does using professional linguists for subtitles improve viewer experience and business outcomes?”

Hypothesis

Better subtitle quality will improve:

Completion rate
Watch time
Viewer satisfaction
Lower subtitle-related complaints
Better engagement in non-native language content

Experiment design

We can randomize content titles, regions, or users depending on the risk.

A clean setup:

Control: Existing subtitle process
Treatment: Subtitles created or reviewed by professional linguists

But we need to be careful.

If users talk to each other or content quality changes at the title level, user-level randomization may be messy.

For subtitle quality, title-level or region-level testing may be better.

Primary metric

I would choose one primary metric based on the goal.

For example:

Completion rate for subtitle-enabled viewing sessions

Why completion rate?

Because if subtitles are better, users may be more likely to finish the content.

Secondary metrics

Watch time per subtitle-enabled session
Subtitle toggle-on rate
Rewatch rate
Thumbs up or rating
Search exits
Customer support complaints
Subtitle correction reports
Engagement with foreign-language titles

Guardrail metrics

Subtitle delivery time
Subtitle production cost
Content launch delay
Error rate
User complaints
Cancellation rate

Segments

I would analyze by:

Country
Language pair
Device
Content genre
New vs returning users
Native vs non-native language viewers
High subtitle usage users

Decision framework

I would launch if:

Completion rate improves
Complaints decrease
Cost increase is justified
No major delay in content availability
Results are consistent across important languages or regions

This case is good because it shows that experiment design is not just math. It also needs product judgment.

⑮ How would you value a piece of content?

I would value content based on the business value it creates over time.

For streaming, a piece of content can create value in several ways.

1. Acquisition value

Does the content bring in new subscribers?

For example, a popular show may convince people to sign up.

Metrics:

New subscriptions after release
Trial starts
Signup conversion
Marketing campaign attribution

2. Retention value

Does the content keep existing users from canceling?

This is often more important than acquisition.

Metrics:

Churn reduction
Renewal rate
Watch frequency
Return visits
Completion rate

3. Engagement value

Does the content increase platform usage?

Metrics:

Total hours watched
Unique viewers
Completion rate
Episodes watched per user
Repeat viewing
Recommendation impact

4. Brand value

Some content makes the platform feel premium.

It may not have the highest watch hours, but it may improve brand perception.

Examples:

Award-winning content
Prestige shows
Culturally important titles
Strong niche content

5. Portfolio value

Content may fill a gap in the catalog.

For example:

Kids content
Regional content
Anime
Sports documentaries
Local language content

A title may be valuable because it serves a specific audience very well.

6. Long-term library value

Some content keeps getting watched for years.

Metrics:

Evergreen watch time
Long-tail engagement
Rewatch rate
Search demand
Recommendation performance

Simple valuation formula

Content Value =
Incremental Acquisition Value
+ Incremental Retention Value
+ Incremental Engagement Value
+ Brand Value
+ Long-Term Library Value
- Content Cost
- Marketing Cost

I would be careful not to give all credit to one title.

A user may join after seeing an ad for one show but stay because of the full catalog.

So attribution should be handled carefully.

⑯ What are the value drivers for Netflix?

I would break Netflix value drivers into customer, content, monetization, and operating drivers.

Customer drivers

Subscriber growth
Retention
Churn reduction
User engagement
Household penetration
International growth
Paid sharing conversion

Netflix becomes more valuable when it can attract and retain users profitably.

Content drivers

Quality of original content
Depth of content library
Local language content
Exclusive rights
Franchise potential
Content freshness
Hit rate of new releases

The content engine matters because users stay when they believe there is always something worth watching.

Monetization drivers

Subscription pricing
Plan mix
Ad-supported plan growth
Revenue per user
Upsell opportunities
Regional pricing strategy

A company can grow not only by adding users, but also by increasing revenue per user.

Engagement drivers

Watch time
Completion rate
Search success
Recommendation quality
App experience
Content discovery

Engagement matters because high engagement usually supports retention.

Cost drivers

Content production cost
Licensing cost
Marketing efficiency
Technology infrastructure
Customer support cost

A platform can grow revenue and still struggle if content costs rise too quickly.

Strategic drivers

Global distribution
Brand strength
Data advantage
Personalization
Partnerships
Live events or special programming
Gaming or newer entertainment formats

In an interview, I would say:

“Netflix value is driven by its ability to acquire users, keep them engaged, reduce churn, monetize through pricing and ads, and manage content costs while continuing to produce shows people care about.”

⑰ What would you consider when valuing a Netflix deal?

First, I would clarify what kind of deal it is.

Is it:

Licensing a show?
Producing an original series?
Buying exclusive streaming rights?
Partnering with a studio?
Sports or live event rights?
Talent deal?
Regional content deal?

Then I would evaluate both value and risk.

Revenue impact

Will this deal bring new subscribers?
Will it reduce churn?
Will it increase engagement?
Will it support ad revenue?
Will it help pricing power?

Audience fit

Which audience does it serve?
Is the audience large enough?
Is it global or regional?
Does it attract a hard-to-reach segment?
Does it strengthen a weak catalog area?

Content performance

I would estimate:

Expected viewers
Completion rate
Watch hours
Rewatch potential
Social buzz
Search demand
Similar title performance

Cost

I would include:

Licensing fee
Production cost
Marketing cost
Localization cost
Legal and rights cost
Opportunity cost

Exclusivity

Exclusive content is usually more valuable than non-exclusive content.

But exclusivity costs more, so I would ask whether exclusivity is worth the premium.

Time value

Some deals create short-term spikes.

Others build long-term library value.

A sports event may drive immediate engagement, while a strong series may create long-tail value for years.

Risk

Risks include:

Content underperformance
Production delay
Audience mismatch
Regional rights complexity
Reputation risk
Cost overruns
Weak retention impact

Decision

I would compare expected incremental value against total cost.

Deal Value =
Incremental Subscriber Value
+ Retention Value
+ Engagement Value
+ Ad Revenue Value
+ Brand Value
+ Long-Term Library Value
- Total Deal Cost

I would recommend the deal only if the expected value is higher than the cost and the strategic fit is strong.

⑱ Design a full experiment end to end and explain your choices.

Let’s design an experiment for a streaming platform.

Product idea

We want to test a new personalized homepage ranking model.

The new model is expected to help users find something to watch faster.

Step 1: Define the goal

The goal is to improve content discovery.

Business goal:

Increase user engagement
Improve retention
Reduce browsing frustration

Step 2: Define the hypothesis

Hypothesis:

“If we improve homepage ranking, users will start watching content faster and watch more content per session.”

Step 3: Choose the unit of randomization

I would randomize at the user level.

Each user sees either:

Control: Current homepage ranking
Treatment: New ranking model

User-level randomization works because homepage experience is personal.

Step 4: Define metrics

Primary metric:

Play start rate per session

This tells us whether users are more likely to find something to watch.

Secondary metrics:

Watch time per user
Time to first play
Homepage click-through rate
Completion rate
Search usage
Return rate next day or next week

Guardrail metrics:

App crashes
Playback errors
Churn
Customer complaints
Diversity of content watched
Latency of homepage loading

I would not rely on only one engagement metric because a ranking model can increase clicks but hurt satisfaction.

Step 5: Sample size and duration

Before launching, I would estimate sample size using:

Baseline play start rate
Minimum detectable effect
Power, usually 80 percent or 90 percent
Significance level, often 0.05
Expected variance

I would run the test long enough to cover normal user behavior.

For entertainment products, I would usually want at least one full weekly cycle because weekday and weekend behavior can be very different.

Step 6: Data quality checks

Before trusting results, I would check:

Sample ratio mismatch
Event logging
Exposure logging
Missing data
Duplicate users
Bot activity
Whether treatment users actually saw the new ranking

Step 7: Launch plan

I would start with a small ramp:

1 percent traffic
Check guardrails
Move to 10 percent
Then 50 percent if stable

This reduces risk.

Step 8: Analyze results

At the end, I would compare treatment vs control.

I would look at:

Difference in primary metric
Confidence interval
p-value
Effect size
Guardrails
Segment-level performance

Important segments:

New users
Returning users
Heavy users
Light users
Mobile users
TV users
Different countries

Step 9: Decision

I would launch if:

Primary metric improves
Guardrails are healthy
Effect is practically meaningful
No major segment is harmed
Technical performance is stable

I would not launch if:

The result is statistically significant but too small to matter
Watch time improves but complaints increase
Clicks improve but completion drops
One major user segment is harmed

Step 10: Follow-up

After launch, I would keep monitoring:

Long-term retention
Content diversity
User complaints
Model drift
Recommendation freshness

A/B testing does not end the moment we launch. Real users keep changing, and the system needs monitoring.

Final Interview Tip

For messy analytics questions, do not rush into formulas.

A strong answer usually follows this structure:

Clarify the business goal
Define the metric
State assumptions
Break the problem into parts
Choose the right method
Mention risks and edge cases
Tie the answer back to business impact

That is what interviewers are really looking for.

Not memorized answers.

Clear thinking.

35 Series A Startups Hiring in 2026

Karthik Adari — Mon, 27 Apr 2026 00:28:18 GMT

If you are job hunting in 2026, one of the smartest places to look is recently funded startups.

Why?

Because after a Series A round, many startups start expanding their engineering, product, sales, operations, data, customer success, and GTM teams. These companies may not always be as crowded as big tech, but they can offer strong learning, faster growth, and early-career opportunities.

Below are 35 USA-based startups from the Q1 2026 Series A list that showed strong active-hiring signals.

Now coming to the list

1. depthfirst

Domain: Cybersecurity / AI Security
Funds raised: $40M Series A
Venture funded: Accel, Alt Capital, BoxGroup, Liquid 2 Ventures, SV Angel
Company Link: https://depthfirst.com
Company career site: https://depthfirst.com/careers
Small Blurb: depthfirst is building AI-native security tools for code, infrastructure, and business logic vulnerabilities. A strong company to watch for applied AI, security engineering, product, and GTM roles.

2. Renterra

Domain: Construction Tech / Equipment Rental SaaS
Funds raised: $9M Series A
Venture funded: Avenue Growth Partners
Company Link: https://getrenterra.com
Company career site: https://getrenterra.com/careers
Small Blurb: Renterra builds rental management software for heavy equipment companies. Good fit for candidates interested in SaaS, operations, customer success, product, and engineering roles.

3. Neurophos

Domain: Semiconductors / Photonic AI Chips
Funds raised: $110M Series A
Venture funded: Gates Frontier, M12, Carbon Direct Capital, Aramco Ventures, Bosch Ventures
Company Link: https://www.neurophos.com
Company career site: https://www.neurophos.com/careers
Small Blurb: Neurophos is working on photonic AI inference chips. This is a strong company to follow for hardware, AI infrastructure, chip design, systems, and research roles.

4. Artie

Domain: Data Infrastructure / Real-Time Streaming
Funds raised: $12M Series A
Venture funded: Standard Capital, Y Combinator, Pathlight Ventures
Company Link: https://www.artie.com
Company career site: https://www.artie.com/careers
Small Blurb: Artie builds real-time data streaming infrastructure for fraud, inventory, analytics, and AI workloads. Great fit for data engineering, backend, infrastructure, and developer tools candidates.

5. Linq

Domain: AI Communication Infrastructure
Funds raised: $20M Series A
Venture funded: TQ Ventures, Mucker Capital, angel investors
Company Link: https://www.linq.ai
Company career site: https://www.linq.ai/careers
Small Blurb: Linq is building a communication layer for AI agents across SMS, iMessage, RCS, and voice. Good company to track for AI, backend, product, and communication platform roles.

6. Fundamental

Domain: Enterprise AI / Tabular AI
Funds raised: $255M Series A
Venture funded: Oak HC/FT, Valor Equity Partners, Battery Ventures, Salesforce Ventures
Company Link: https://www.fundamental.ai
Company career site: https://www.fundamental.ai/careers
Small Blurb: Fundamental builds Large Tabular Models for enterprise decision-making. Strong fit for ML engineers, data scientists, research engineers, enterprise AI, and GTM roles.

7. Rowspace

Domain: FinTech / Institutional AI
Funds raised: $50M Seed + Series A
Venture funded: Sequoia, Emergence Capital, Stripe, Conviction, Basis Set
Company Link: https://www.rowspace.com
Company career site: https://www.rowspace.com/careers
Small Blurb: Rowspace is building AI tools for institutional finance and portfolio decision-making. A strong startup to watch for fintech, AI, data, and backend roles.

8. Corridor

Domain: Cybersecurity / AI Coding Security
Funds raised: $25M Series A
Venture funded: Felicis, Conviction, Timeless, Lux Capital, Datadog, SV Angel
Company Link: https://www.corridor.dev
Company career site: https://www.corridor.dev/jobs
Small Blurb: Corridor focuses on security for AI-native software development. Good fit for candidates interested in AppSec, AI security, developer tools, and infrastructure.

9. Gimlet Labs

Domain: AI Infrastructure / Serverless Inference
Funds raised: $80M Series A
Venture funded: Menlo Ventures, Eclipse Ventures, Prosperity7, Triatomic, Factory
Company Link: https://www.gimletlabs.ai
Company career site: https://www.gimletlabs.ai/careers
Small Blurb: Gimlet Labs is building serverless inference infrastructure for AI agents and multi-agent systems. Strong fit for systems, distributed computing, ML infrastructure, and backend roles.

10. Cloudforce

Domain: AI / Healthcare / Public Sector
Funds raised: $10M Series A
Venture funded: Owl Ventures, M12
Company Link: https://www.gocloudforce.com
Company career site: https://www.gocloudforce.com/careers
Small Blurb: Cloudforce builds AI solutions for regulated sectors like healthcare and the public sector. Good startup to watch for AI, cloud, compliance, and implementation roles.

11. Converge Bio

Domain: AI Drug Discovery / Biotech
Funds raised: $25M Series A
Venture funded: Bessemer Venture Partners, TLV Partners, Vintage Investment Partners, Saras Capital
Company Link: https://converge-bio.com
Company career site: https://converge-bio.com/careers
Small Blurb: Converge Bio uses AI to support drug discovery and development. Strong fit for computational biology, bioinformatics, ML, and data science candidates.

12. SkyFi

Domain: Earth Intelligence / Geospatial AI
Funds raised: $12.7M Series A
Venture funded: Buoyant Ventures, IronGate Capital Advisors, DNV Ventures, TFX Ventures, J2 Ventures
Company Link: https://skyfi.com
Company career site: https://skyfi.com/careers
Small Blurb: SkyFi is an Earth intelligence platform built around satellite imagery and geospatial analytics. Good fit for GIS, computer vision, data science, and defense-tech roles.

13. Zarminali Pediatrics

Domain: Healthcare / Pediatrics
Funds raised: $110M Series A
Venture funded: General Catalyst, Healthier Capital, K2 HealthVentures
Company Link: https://zarminali.com
Company career site: https://zarminali.com/careers
Small Blurb: Zarminali Pediatrics is building a tech-focused pediatric care group. Strong company to watch for healthcare operations, product, data, clinical, and support roles.

14. Cubby

Domain: PropTech / Storage Management SaaS
Funds raised: $63M Series A
Venture funded: Goldman Sachs Alternatives Growth Equity
Company Link: https://www.cubbystorage.com
Company career site: https://www.cubbystorage.com/careers
Small Blurb: Cubby builds AI-native property management software for self-storage operators. Good fit for SaaS, operations, engineering, support, and product roles.

15. Cambio

Domain: Commercial Real Estate AI / Climate Tech
Funds raised: $18M Series A
Venture funded: Maverick Ventures, Y Combinator, Adverb Ventures, Peterson Ventures
Company Link: https://cambio.ai
Company career site: https://cambio.ai/careers
Small Blurb: Cambio helps commercial real estate teams improve building performance and retrofit planning. Strong fit for climate tech, data, product, and real estate operations roles.

16. Mia Labs

Domain: Automotive AI / Voice AI
Funds raised: $20M Series A
Venture funded: Permanent Capital Ventures, Norwest, Eniac Ventures, Vine Ventures
Company Link: https://www.mia.inc
Company career site: https://www.mia.inc/careers
Small Blurb: Mia Labs builds conversational AI tools for automotive dealerships. Good fit for AI, voice systems, customer success, sales engineering, and backend roles.

17. Tradespace

Domain: LegalTech / IP Management AI
Funds raised: $15M Series A
Venture funded: AVP, Eniac Ventures, Amplo VC, Scrum Ventures
Company Link: https://tradespace.io
Company career site: https://tradespace.io/careers
Small Blurb: Tradespace builds AI-native tools for invention disclosure, patents, and IP workflows. Good fit for legaltech, AI, product, data, and enterprise SaaS roles.

18. Concourse

Domain: FinTech / Finance AI Agents
Funds raised: $12M Series A
Venture funded: Standard Capital, Andreessen Horowitz, CRV, Y Combinator
Company Link: https://www.concourse.co
Company career site: https://www.concourse.co/careers
Small Blurb: Concourse builds AI agents for corporate finance teams. Strong fit for candidates interested in finance automation, AI agents, backend, and product roles.

19. Datatruck

Domain: Logistics SaaS / Trucking AI
Funds raised: $12M Series A
Venture funded: Avenue Growth Partners
Company Link: https://www.datatruck.io
Company career site: https://www.datatruck.io/careers
Small Blurb: Datatruck builds an AI-native operating system for trucking companies. Good fit for logistics, SaaS, operations, product, data, and customer success roles.

20. Checkbox

Domain: LegalTech / AI Agents
Funds raised: $23M Series A
Venture funded: Touring Capital, Peak XV, Conductive Ventures, Tidal Ventures
Company Link: https://www.checkbox.ai
Company career site: https://www.checkbox.ai/careers
Small Blurb: Checkbox builds AI agent solutions for in-house legal teams. Strong fit for candidates interested in legal automation, workflow tools, product, and enterprise AI.

21. XBuild

Domain: Construction AI / Estimating
Funds raised: $19M Series A
Venture funded: N47, Rackhouse Ventures, Andreessen Horowitz
Company Link: https://www.xbuild.ai
Company career site: https://www.xbuild.ai/careers
Small Blurb: XBuild uses AI to support construction estimating and proposal generation. Good fit for AI, SaaS, construction tech, product, and GTM roles.

22. Resolve AI

Domain: SRE / Engineering AI Agents
Funds raised: $125M Series A
Venture funded: Publicly reported Series A investors
Company Link: https://resolve.ai
Company career site: https://resolve.ai/careers
Small Blurb: Resolve AI helps engineering teams automate incident response and reliability workflows. Strong fit for SRE, DevOps, backend, AI agents, and platform roles.

23. Didero

Domain: Procurement AI / Enterprise SaaS
Funds raised: $30M Series A
Venture funded: Chemistry, Headline, M12
Company Link: https://www.didero.ai
Company career site: https://www.didero.ai/careers
Small Blurb: Didero builds AI agents for procurement teams, manufacturers, and distributors. Good fit for enterprise AI, product, supply chain, and SaaS roles.

24. Take2

Domain: Healthcare Recruiting AI / HRTech
Funds raised: $14M Series A
Venture funded: Human Capital, Bertelsmann Healthcare Investments, Reach Capital
Company Link: https://www.take2.ai
Company career site: https://www.take2.ai/careers
Small Blurb: Take2 builds AI agents for healthcare recruiting, credentialing, scheduling, and onboarding. Good fit for HRTech, healthcare operations, AI, and customer success roles.

25. Integrate

Domain: DefenseTech / Project Management SaaS
Funds raised: $17M Series A
Venture funded: FPV Ventures, Fuse VC, Rsquared VC
Company Link: https://www.integrate.co
Company career site: https://www.integrate.co/careers
Small Blurb: Integrate builds project management software for defense, space, cyber, maritime, and aerospace programs. Strong fit for defense-tech, product, engineering, and program operations roles.

26. Zero Homes

Domain: ClimateTech / Home Electrification
Funds raised: $16.8M Series A
Venture funded: Prelude Ventures, SJF Ventures, Watsco Ventures, VoLo Earth Ventures
Company Link: https://www.zerohomes.io
Company career site: https://www.zerohomes.io/careers
Small Blurb: Zero Homes helps homeowners electrify through heat pumps, insulation, EV chargers, and related projects. Good fit for climate tech, operations, data, and product roles.

27. Humand

Domain: HRTech / Deskless Workforce AI
Funds raised: $66M Series A
Venture funded: Kaszek, Goodwater Capital, Y Combinator, angel investors
Company Link: https://humand.co
Company career site: https://humand.co/careers
Small Blurb: Humand builds an AI-powered operating system for deskless workforces. Strong fit for HRTech, SaaS, product, implementation, and customer success roles.

28. Coral Care

Domain: Pediatric Healthcare / Therapy Marketplace
Funds raised: $13M Series A
Venture funded: Haymaker Ventures, FCA Ventures, Peterson Ventures, AlleyCorp, Reach Capital
Company Link: https://www.joincoralcare.com
Company career site: https://www.joincoralcare.com/careers
Small Blurb: Coral Care expands access to in-home pediatric speech, occupational, and physical therapy. Good fit for healthcare operations, product, support, and marketplace roles.

29. Third Way Health

Domain: Healthcare Services / Automation
Funds raised: $15M Series A
Venture funded: Health Velocity Capital
Company Link: https://www.thirdway.health
Company career site: https://www.thirdway.health/careers
Small Blurb: Third Way Health supports healthcare organizations with front-office services like scheduling and prior authorization. Good fit for healthcare operations, automation, and customer success roles.

30. Halcyon

Domain: Energy AI / Data Infrastructure
Funds raised: $21M Series A
Venture funded: Energize Capital, Zero Infinity Partners, Congruent Ventures, Obvious Ventures
Company Link: https://www.halcyon.eco
Company career site: https://www.halcyon.eco/careers
Small Blurb: Halcyon builds AI tools for energy professionals, power-market intelligence, and data center siting. Strong fit for energy, data, AI, climate, and infrastructure roles.

31. Conduit Health

Domain: Healthcare / Medicare & Medicaid Services
Funds raised: $17M Series A
Venture funded: Drive Capital, XYZ Ventures, Twelve Below, Eniac Ventures
Company Link: https://www.conduithealth.com
Company career site: https://www.conduithealth.com/careers
Small Blurb: Conduit Health provides insurance-covered medical supplies and services for Medicare and Medicaid patients. Good fit for healthcare operations, support, data, and growth roles.

32. Deeptune

Domain: AI Simulation / Agent Training
Funds raised: $43M Series A
Venture funded: Andreessen Horowitz, 776, Abstract Ventures, Inspired Capital
Company Link: https://www.deeptune.ai
Company career site: https://www.deeptune.ai/careers
Small Blurb: Deeptune builds simulation environments where AI agents can practice complex tasks. Strong fit for AI research, simulation, ML engineering, and infrastructure roles.

33. Edra

Domain: Workflow Automation / AI Agents
Funds raised: $30M Series A
Venture funded: Sequoia Capital, A*, 8VC
Company Link: https://edra.com
Company career site: https://edra.com/careers
Small Blurb: Edra builds AI agents that learn business operations and automate repetitive workflows. Good fit for AI agents, workflow automation, backend, and product roles.

34. BlueFlag Security

Domain: Cybersecurity / Developer Identity Governance
Funds raised: $16.5M Series A
Venture funded: Maverick Ventures, Ten Eleven Ventures
Company Link: https://www.blueflagsecurity.com
Company career site: https://www.blueflagsecurity.com/careers
Small Blurb: BlueFlag Security focuses on identity-centric security for developers, contractors, non-human identities, and AI agents. Strong fit for cybersecurity, identity, DevSecOps, and platform roles.

35. Starcloud

Domain: SpaceTech / Data Centers in Space
Funds raised: $170M Series A
Venture funded: Publicly reported Series A investors
Company Link: https://www.starcloud.com
Company career site: https://www.starcloud.com/careers
Small Blurb: Starcloud is building data centers in space. A very interesting company to follow for aerospace, distributed systems, AI infrastructure, thermal engineering, and hardware roles.

Final note for job seekers

Recently funded startups are not always easy to discover through regular job boards.

That is exactly why they are worth tracking.

Do not only apply to the same 10 big companies everyone is applying to. Build a startup watchlist, check their career pages every week, connect with team members, and apply early when roles open.

These 35 companies are a good starting point if you are exploring roles in AI, cybersecurity, healthcare, climate tech, fintech, defense tech, data infrastructure, and space tech.

50 Data Center Projects for MS BA, MS DS, and Tech Students

Karthik Adari — Sat, 25 Apr 2026 23:07:38 GMT

But here is the important part.

These projects are not only for computer science students. Many MS Business Analytics, MS Data Science, MS Information Technology, and Cybersecurity students can also work on these because modern data centers generate a lot of data: server metrics, logs, power usage, ticket trends, uptime, capacity usage, network traffic, cloud cost, and security alerts.

Before using the resume points below, please remember:

Resume bullet points are only for reference. Adjust them based on what you actually build, measure, and customize for your own profile.

Resume for Reference - https://www.overleaf.com/read/szjhqgsvcdyw#e6311a

Project 1. NetBox DCIM/IPAM Lab

Github link - https://github.com/netbox-community/netbox

Best fit MS Backgrounds - MS IT, MS CS, MS BA, MS DS

Resume-Ready Bullet Points -

Built a data center inventory system using NetBox to track rack utilization, device inventory, IP addresses, and circuits.
Created structured infrastructure datasets for asset reporting, capacity planning, and operational analysis.
Automated device and IP documentation using NetBox APIs to reduce manual tracking effort by X%.

Project 2. openDCIM Data Center Inventory

Github link - https://github.com/opendcim/openDCIM

Best fit MS Backgrounds - MS IT, MS BA, MS DS

Resume-Ready Bullet Points -

Implemented openDCIM to manage rack-level assets, cabinet usage, power allocation, and device inventory.
Built utilization reports to identify unused rack capacity and infrastructure gaps across X cabinets.
Created a data center asset-tracking workflow to improve visibility into hardware location, ownership, and capacity usage.

Project 3. Nautobot Network Source of Truth

Github link - https://github.com/nautobot/nautobot

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Built a network source-of-truth system using Nautobot to manage devices, sites, IP addresses, and circuits.
Created API-based workflows to validate network inventory accuracy across X network assets.
Designed reports and dashboards to improve visibility into network documentation quality and infrastructure readiness.

Project 4. Data Center Asset Analytics Dashboard

Github link - https://github.com/netbox-community/netbox

Best fit MS Backgrounds - MS BA, MS DS, MS IT

Resume-Ready Bullet Points -

Analyzed NetBox asset data to identify rack utilization, device density, and capacity trends.
Built a dashboard summarizing infrastructure usage by rack, site, device type, and IP allocation.
Generated capacity planning insights to support decisions around hardware expansion, rack space, and network resources.

Project 5. IP Address Management Analytics

Github link - https://github.com/SpriteLink/NIPAP

Best fit MS Backgrounds - MS BA, MS DS, MS IT

Resume-Ready Bullet Points -

Built an IP address utilization dashboard to track subnet usage, available IPs, and allocation efficiency.
Analyzed IP allocation patterns to reduce unused blocks and improve network planning.
Created IP capacity reports showing X% subnet utilization and future availability risk.

Project 6. MAAS Bare-Metal Provisioning

Github link - https://github.com/canonical/maas

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Deployed MAAS to automate bare-metal server discovery, PXE boot, and OS provisioning.
Configured repeatable server deployment workflows across X nodes.
Tracked provisioning time and documented improvements in server onboarding speed by X%.

Project 7. Foreman Server Lifecycle Management

Github link - https://github.com/theforeman/foreman

Best fit MS Backgrounds - MS IT, MS CS, MS BA

Resume-Ready Bullet Points -

Built a server lifecycle management workflow using Foreman for provisioning, patch tracking, and host inventory.
Created infrastructure reports for patch compliance, host status, and operational visibility.
Improved server documentation by centralizing records for X systems.

Project 8. Cobbler PXE Deployment Lab

Github link - https://github.com/cobbler/cobbler

Best fit MS Backgrounds - MS IT, MS CS

Resume-Ready Bullet Points -

Built a PXE-based Linux provisioning lab using Cobbler for automated OS installation.
Configured DHCP, DNS, Kickstart, and boot profiles for repeatable server deployments.
Reduced manual installation steps by automating server setup across X Linux machines.

Project 9. FOG Imaging Project

Github link - https://github.com/FOGProject/fogproject

Best fit MS Backgrounds - MS IT, MS CS, MS BA

Resume-Ready Bullet Points -

Implemented an imaging workflow to deploy and restore systems using FOG Project.
Created device inventory and imaging status reports for X endpoints/servers.
Improved system recovery readiness by standardizing image deployment and backup workflows.

Project 10. Ansible Data Center Automation

Github link - https://github.com/ansible/ansible

Best fit MS Backgrounds - MS IT, MS CS, MS DS

Resume-Ready Bullet Points -

Automated Linux server configuration using Ansible playbooks for users, packages, services, and security settings.
Built reusable roles for monitoring agent installation and baseline server hardening.
Reduced repetitive administration tasks by automating X infrastructure workflows.

Project 11. Terraform Infrastructure as Code

Github link - https://github.com/hashicorp/terraform

Best fit MS Backgrounds - MS CS, MS IT, MS BA

Resume-Ready Bullet Points -

Built reusable Terraform modules to provision infrastructure resources consistently across environments.
Managed infrastructure configuration using variables, state files, and modular design.
Improved deployment repeatability by defining X infrastructure resources as code.

Project 12. OpenTofu IaC Lab

Github link - https://github.com/opentofu/opentofu

Best fit MS Backgrounds - MS CS, MS IT, MS BA

Resume-Ready Bullet Points -

Created OpenTofu modules to automate infrastructure deployment and environment setup.
Designed reusable templates for development, testing, and production-like infrastructure.
Practiced version-controlled infrastructure management across X environments.

Project 13. Packer Golden Image Builder

Github link - https://github.com/hashicorp/packer

Best fit MS Backgrounds - MS IT, MS CS

Resume-Ready Bullet Points -

Built repeatable Linux golden images using Packer for faster server provisioning.
Automated baseline package installation, security configuration, and image validation.
Reduced manual server setup time by creating reusable VM images with X standard configurations.

Project 14. Kubernetes Cluster Operations

Github link - https://github.com/kubernetes/kubernetes

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Deployed and managed a Kubernetes cluster to understand workloads, services, scheduling, and scaling.
Monitored pod health, resource usage, and cluster availability using operational metrics.
Documented troubleshooting steps for failed deployments, scaling issues, and service downtime.

Project 15. Kubespray Bare-Metal Kubernetes

Github link - https://github.com/kubernetes-sigs/kubespray

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Built a multi-node Kubernetes cluster using Kubespray and Ansible automation.
Configured networking, storage, node roles, and cluster validation workflows.
Deployed Kubernetes on X nodes and documented high-availability setup steps.

Project 16. K3s Edge/Data Center Mini Cluster

Github link - https://github.com/k3s-io/k3s

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Built a lightweight Kubernetes cluster using K3s to simulate edge and small data center environments.
Deployed containerized workloads and monitored CPU, memory, disk, and network usage.
Created a compact infrastructure lab for testing automation, monitoring, and workload deployment.

Project 17. MetalLB Bare-Metal Load Balancing

Github link - https://github.com/metallb/metallb

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Configured MetalLB to provide load balancing for a bare-metal Kubernetes cluster.
Tested Layer 2 and BGP-based service exposure for internal applications.
Improved service availability by enabling load-balanced access across X Kubernetes services.

Project 18. KubeVirt VM on Kubernetes

Github link - https://github.com/kubevirt/kubevirt

Best fit MS Backgrounds - MS CS, MS IT

Resume-Ready Bullet Points -

Deployed virtual machines inside Kubernetes using KubeVirt.
Compared VM-based and container-based workloads across resource usage, deployment speed, and management complexity.
Documented how traditional virtualization and Kubernetes can operate in the same infrastructure environment.

Project 19. Harvester Hyperconverged Infrastructure

Github link - https://github.com/harvester/harvester

Best fit MS Backgrounds - MS IT, MS CS, MS BA

Resume-Ready Bullet Points -

Built a hyperconverged infrastructure lab using Harvester for VMs, storage, and cluster resource management.
Created infrastructure capacity notes covering compute, memory, storage, and workload usage.
Managed virtual workloads through a unified private cloud-style platform.

Project 20. OpenStack-Ansible Private Cloud

Github link - https://github.com/openstack/openstack-ansible

Best fit MS Backgrounds - MS CS, MS IT, MS BA

Resume-Ready Bullet Points -

Deployed a private cloud environment using OpenStack-Ansible.
Configured compute, networking, and storage services for VM provisioning.
Documented private cloud operations including instance creation, resource allocation, and service validation.

Project 21. Apache CloudStack Private Cloud

Github link - https://github.com/apache/cloudstack

Best fit MS Backgrounds - MS CS, MS IT, MS BA

Resume-Ready Bullet Points -

Built a private cloud lab using Apache CloudStack to manage virtual infrastructure.
Monitored compute pools, VM usage, network resources, and storage allocation.
Created infrastructure reports showing VM capacity, usage trends, and operational status.

Project 22. OpenNebula Private/Edge Cloud

Github link - https://github.com/OpenNebula/one

Best fit MS Backgrounds - MS CS, MS IT, MS BA

Resume-Ready Bullet Points -

Deployed OpenNebula to manage virtualized infrastructure resources.
Created VM templates and automated workload deployment across X virtual machines.
Analyzed resource allocation across compute, storage, and network pools.

Project 23. Prometheus Metrics Monitoring

Github link - https://github.com/prometheus/prometheus

Best fit MS Backgrounds - MS DS, MS BA, MS IT, MS CS

Resume-Ready Bullet Points -

Built a metrics monitoring system using Prometheus for servers, applications, and infrastructure services.
Wrote PromQL queries to analyze CPU usage, memory consumption, disk I/O, and network traffic.
Configured alerts for infrastructure health thresholds, reducing manual monitoring effort by X%.

Project 24. Grafana Infrastructure Dashboard

Github link - https://github.com/grafana/grafana

Best fit MS Backgrounds - MS BA, MS DS, MS IT

Resume-Ready Bullet Points -

Designed Grafana dashboards for server health, network traffic, uptime, and capacity usage.
Converted raw infrastructure metrics into visual reports for operational and business decision-making.
Created alert panels to track SLA performance, downtime risk, and resource saturation.

Project 25. Grafana Loki Log Analytics

Github link - https://github.com/grafana/loki

Best fit MS Backgrounds - MS DS, MS BA, MS IT, MS Cybersecurity

Resume-Ready Bullet Points -

Built a centralized log analytics system using Loki and Grafana.
Created queries to identify repeated errors, service failures, latency spikes, and incident patterns.
Developed dashboards for operational troubleshooting and reduced log investigation time by X%.

Project 26. OpenTelemetry Collector Pipeline

Github link - https://github.com/open-telemetry/opentelemetry-collector

Best fit MS Backgrounds - MS DS, MS CS, MS IT

Resume-Ready Bullet Points -

Built a telemetry collection pipeline using OpenTelemetry Collector.
Routed metrics, logs, and traces from services into observability tools.
Documented end-to-end data flow architecture for infrastructure monitoring and incident analysis.

Project 27. Netdata Real-Time Server Monitoring

Github link - https://github.com/netdata/netdata

Best fit MS Backgrounds - MS BA, MS DS, MS IT

Resume-Ready Bullet Points -

Deployed Netdata to monitor CPU, memory, disk, network, and service health in real time.
Created operational dashboards to visualize server performance and infrastructure behavior.
Analyzed resource spikes during workload testing and identified X performance bottlenecks.

Project 28. Zabbix Enterprise Monitoring

Github link - https://github.com/zabbix/zabbix

Best fit MS Backgrounds - MS IT, MS BA, MS DS

Resume-Ready Bullet Points -

Implemented Zabbix monitoring for servers, services, and network devices.
Configured alerts for downtime, resource saturation, disk usage, and service failures.
Created infrastructure availability reports showing uptime, incident frequency, and response patterns.

Project 29. LibreNMS Network Monitoring

Github link - https://github.com/librenms/librenms

Best fit MS Backgrounds - MS IT, MS BA, MS DS

Resume-Ready Bullet Points -

Deployed LibreNMS to monitor routers, switches, and server interfaces using SNMP.
Built dashboards for bandwidth usage, device health, interface status, and network utilization.
Created alerting rules for interface downtime and unusual traffic patterns across X devices.

Project 30. Icinga2 Availability Monitoring

Github link - https://github.com/Icinga/icinga2

Best fit MS Backgrounds - MS IT, MS BA

Resume-Ready Bullet Points -

Configured Icinga2 checks for server uptime, services, and infrastructure availability.
Designed alert rules for failed services, degraded performance, and availability drops.
Documented incident response workflows for service outages and monitoring escalations.

Project 31. OpenSearch Log Search & Analytics

Github link - https://github.com/opensearch-project/OpenSearch

Best fit MS Backgrounds - MS DS, MS BA, MS Cybersecurity, MS IT

Resume-Ready Bullet Points -

Built an OpenSearch-based log analytics system for infrastructure and security events.
Created searchable indexes for server logs, application logs, and security alerts.
Designed dashboards to analyze incident trends, error frequency, and operational patterns.

Project 32. OpenSearch Dashboards BI Project

Github link - https://github.com/opensearch-project/OpenSearch-Dashboards

Best fit MS Backgrounds - MS BA, MS DS, MS IT

Resume-Ready Bullet Points -

Created OpenSearch dashboards for infrastructure KPIs, incident counts, and log trends.
Built visual reports to support operational decision-making and reliability reviews.
Connected log data to business-style metrics such as SLA performance, MTTR, and downtime frequency.

Project 33. OpenCost Kubernetes Cost Monitoring

Github link - https://github.com/opencost/opencost

Best fit MS Backgrounds - MS BA, MS DS, MS CS, MS IT

Resume-Ready Bullet Points -

Deployed OpenCost to track Kubernetes resource cost by namespace, workload, and service.
Built cost allocation reports for infrastructure usage analysis and chargeback-style reporting.
Identified cost optimization opportunities from underutilized workloads, targeting X% cost reduction.

Project 34. Cloud Carbon Footprint

Github link - https://github.com/cloud-carbon-footprint/cloud-carbon-footprint

Best fit MS Backgrounds - MS BA, MS DS, MS Sustainability, MS IT

Resume-Ready Bullet Points -

Built a cloud sustainability dashboard to estimate energy usage, carbon impact, and cloud cost trends.
Analyzed infrastructure usage patterns to identify high-impact workloads.
Created recommendations to reduce cloud waste and improve sustainability reporting by X%.

Project 35. Kepler Kubernetes Energy Monitoring

Github link - https://github.com/sustainable-computing-io/kepler

Best fit MS Backgrounds - MS DS, MS BA, MS CS, MS IT

Resume-Ready Bullet Points -

Deployed Kepler to collect Kubernetes energy metrics at node, pod, and container levels.
Built dashboards to analyze energy consumption, workload efficiency, and resource usage patterns.
Created insights for greener infrastructure and workload optimization across X services.

Project 36. kube-green Workload Energy Optimization

Github link - https://github.com/kube-green/kube-green

Best fit MS Backgrounds - MS BA, MS DS, MS CS

Resume-Ready Bullet Points -

Configured kube-green to reduce Kubernetes resource usage during non-working hours.
Measured workload sleep/wake behavior and estimated compute savings, energy savings, and cost reduction.
Documented sustainability and cost optimization outcomes across X workloads.

Project 37. Network UPS Tools Power Monitoring

Github link - https://github.com/networkupstools/nut

Best fit MS Backgrounds - MS IT, MS BA, MS DS

Resume-Ready Bullet Points -

Configured UPS monitoring to track power status, battery health, and outage events.
Built reports for backup power availability, power incidents, and battery performance trends.
Automated graceful shutdown logic to protect systems during power failure scenarios.

Project 38. Ceph Distributed Storage

Github link - https://github.com/ceph/ceph

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Deployed a Ceph storage cluster to understand distributed storage, replication, and fault tolerance.
Monitored storage capacity, disk health, cluster status, and recovery behavior.
Documented storage failure scenarios and recovery workflows across X storage nodes.

Project 39. Rook Ceph on Kubernetes

Github link - https://github.com/rook/rook

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Deployed Ceph storage on Kubernetes using Rook.
Created persistent storage for container workloads and tested failure recovery.
Monitored storage utilization, volume health, and availability across the Kubernetes cluster.

Project 40. Longhorn Kubernetes Storage

Github link - https://github.com/longhorn/longhorn

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Implemented Longhorn for persistent Kubernetes storage.
Configured volume replication, backup, and restore workflows.
Tested application recovery after simulated storage failure and measured recovery time.

Project 41. MinIO Object Storage

Github link - https://github.com/minio/minio

Best fit MS Backgrounds - MS DS, MS BA, MS CS, MS IT

Resume-Ready Bullet Points -

Built an S3-compatible object storage system using MinIO.
Stored logs, backup files, and analytics datasets in a self-hosted object storage layer.
Designed a storage workflow to support data pipelines, infrastructure logs, and backup use cases.

Project 42. Velero Kubernetes Backup & Disaster Recovery

Github link - https://github.com/vmware-tanzu/velero

Best fit MS Backgrounds - MS IT, MS CS, MS BA

Resume-Ready Bullet Points -

Configured Velero to back up and restore Kubernetes workloads.
Tested disaster recovery scenarios for namespaces, deployments, and persistent volumes.
Documented recovery time, backup success rate, and validation steps for workload restoration.

Project 43. TrueNAS Middleware Storage Project

Github link - https://github.com/truenas/middleware

Best fit MS Backgrounds - MS IT, MS CS, MS DS

Resume-Ready Bullet Points -

Explored NAS storage workflows using TrueNAS middleware concepts.
Analyzed storage pool usage, snapshots, datasets, and storage API behavior.
Built documentation for ZFS-based storage operations and storage health monitoring.

Project 44. Batfish Network Validation

Github link - https://github.com/batfish/batfish

Best fit MS Backgrounds - MS DS, MS CS, MS IT

Resume-Ready Bullet Points -

Used Batfish to analyze network configurations and detect routing or policy issues.
Built validation checks to identify misconfigurations before deployment.
Created reports summarizing network risk, configuration errors, and routing validation results.

Project 45. Nornir Network Automation

Github link - https://github.com/nornir-automation/nornir

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Built Python-based network automation tasks using Nornir.
Automated device inventory checks, configuration backups, and status collection.
Generated structured network data for analytics and reporting across X devices.

Project 46. NAPALM Multi-Vendor Network Automation

Github link - https://github.com/napalm-automation/napalm

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Automated multi-vendor network device checks using NAPALM.
Collected configuration, interface, and device state data for analysis.
Built validation scripts to compare expected vs actual network state across X devices.

Project 47. FRRouting Data Center Routing Lab

Github link - https://github.com/FRRouting/frr

Best fit MS Backgrounds - MS CS, MS IT, MS DS

Resume-Ready Bullet Points -

Built a routing lab using FRRouting to practice BGP, OSPF, EVPN, and Linux networking.
Simulated data center routing scenarios and tested failover behavior.
Documented routing table changes, network convergence, and failover response time.

Project 48. SONiC Network OS Study Lab

Github link - https://github.com/sonic-net/SONiC

Best fit MS Backgrounds - MS CS, MS IT

Resume-Ready Bullet Points -

Explored SONiC architecture for cloud-scale data center switching.
Studied switch OS components, routing behavior, and network operations workflows.
Documented key concepts around BGP, switch management, network OS design, and data center switching.

Project 49. Wazuh Security Monitoring

Github link - https://github.com/wazuh/wazuh

Best fit MS Backgrounds - MS Cybersecurity, MS DS, MS BA, MS IT

Resume-Ready Bullet Points -

Deployed Wazuh to collect and analyze security events from servers.
Built dashboards for security alerts, compliance checks, endpoint activity, and incident trends.
Investigated log patterns to identify suspicious activity and improve infrastructure security visibility.

Project 50. Zeek Network Security Analytics

Github link - https://github.com/zeek/zeek

Best fit MS Backgrounds - MS Cybersecurity, MS DS, MS BA, MS IT

Resume-Ready Bullet Points -

Used Zeek to generate structured network security logs from traffic.
Analyzed connection logs, DNS logs, HTTP logs, and traffic patterns to identify suspicious behavior.
Built a network security analytics dashboard for incident investigation and anomaly detection.

Final Note

You do not need to complete all 50 projects.

Pick 3 to 5 projects based on your target role.

For MS BA students, start with:

Grafana Infrastructure Dashboard
OpenCost Kubernetes Cost Monitoring
Cloud Carbon Footprint
NetBox Asset Analytics Dashboard
OpenSearch Dashboards

For MS DS students, start with:

Prometheus Metrics Monitoring
Grafana Loki Log Analytics
OpenTelemetry Collector Pipeline
Kepler Kubernetes Energy Monitoring
Zeek Network Security Analytics

For infrastructure and data center roles, start with:

NetBox
MAAS
Ansible
Kubespray
Prometheus
LibreNMS
Ceph
FRRouting

Again, one important reminder:

Resume bullet points are only for reference. Adjust them based on your actual implementation, your target role, your results, and the metrics you personally achieve.

How I’d Actually Become a Data Analyst in 2026

Karthik Adari — Sat, 04 Apr 2026 15:23:53 GMT

From zero to advanced, without wasting months learning random tools

Every week, I see people asking the same question:

“How do I become a data analyst?”

And most of the time, the answers are either too vague or too overwhelming.

Some people say, “Just learn SQL and Excel.”
Some say, “Do Python, Tableau, Power BI, statistics, machine learning, cloud, and AI.”
And then beginners end up doing a little bit of everything… and mastering nothing.

That’s the real problem.

The goal is not to collect tools.
The goal is to become someone who can look at messy data, figure out what matters, and explain it in a way that helps a business take action.

That’s what a good data analyst actually does.

So if I had to start again today, this is the roadmap I’d follow.

1) Start with spreadsheets first

I know spreadsheets don’t sound exciting.
But this is where most real business data still lives.

Before jumping into dashboards or Python, get genuinely comfortable with Excel or Google Sheets.

Learn:

formulas like SUM, AVERAGE, IF, COUNTIF, XLOOKUP
sorting and filtering
conditional formatting
pivot tables
basic charts
simple dashboards
manual data cleaning

This stage matters more than people think.

Because if you cannot take a messy spreadsheet and make it readable, advanced tools won’t magically fix that.

Your first goal should be simple:

Take raw data and turn it into something a non-technical person can understand in two minutes.

2) Learn basic statistics without overcomplicating it

A lot of beginners get scared when they hear “statistics,” but honestly, you do not need to become a statistician.

You just need enough to avoid making bad conclusions.

Focus on:

mean, median, mode
percentages and growth rates
variance and standard deviation
correlation vs causation
probability basics
distributions
confidence intervals
basic hypothesis testing intuition

Why this matters:

Because a lot of people can build charts.
Very few can tell whether the pattern in that chart is actually meaningful.

A data analyst should be able to answer questions like:

Is this change normal?
Is this trend real?
Is this just random noise?
Should the business care about this?

If you can answer those properly, you’re already ahead of a lot of people.

3) Understand the business, not just the data

This is where many people get stuck.

They become “tool people” instead of “problem solvers.”

A company is not hiring you because you know how to use a dashboard tool.
They’re hiring you because they want someone who can help them understand:

why sales dropped
why users stopped converting
which channel is wasting money
what product change is actually working

So start learning business concepts early:

revenue
profit
margin
conversion rate
churn
retention
funnel analysis
cohort analysis
segmentation

The habit you want to build is this:

What is happening? Why is it happening? What should be done next?

That thinking is what makes an analyst valuable.

4) Learn SQL properly

If spreadsheets are your starting point, SQL is your real entry ticket into data analytics.

This is one of the most important skills in the entire roadmap.

Start with:

SELECT
WHERE
ORDER BY
GROUP BY
HAVING
joins
subqueries
CTEs
CASE WHEN
date functions
window functions

Then push into more business-style use cases:

monthly revenue trends
retention analysis
ranking top customers
finding duplicate records
identifying inactive users
comparing categories across time

The mistake many people make is learning SQL like it’s a syntax exercise.

Don’t do that.

Learn SQL like you’re answering business questions.

That’s when it starts becoming useful.

5) Learn visualization after you learn how to think

This is the stage where people start feeling like a “real analyst,” because now the data becomes visible.

Pick Tableau or Power BI and get comfortable with:

bar charts
line charts
scatter plots
heatmaps
maps
dashboard layout
filters
drill-downs
storytelling with data

But here’s the important part:

A dashboard is not supposed to look impressive.
It is supposed to make the answer obvious.

That’s the real standard.

A strong dashboard helps someone instantly see:

what changed
where the problem is
what needs attention
what decision should be taken next

That’s how you should build.

6) Learn Python when you’re ready to go beyond manual work

Once you’re comfortable with spreadsheets, SQL, and basic visualization, then Python starts making a lot more sense.

Because now you know why you’re using it.

Focus on:

pandas
numpy
matplotlib
basic EDA
cleaning missing values
merging datasets
grouping and aggregation
reading CSV and Excel files
automating repetitive analysis tasks

Python helps when:

the data is bigger
the cleaning is messier
the analysis needs repetition
you want more flexibility than spreadsheets can give

You do not need to become a software engineer here.

You just need to become the kind of analyst who can take messy data, clean it, analyze it, and explain what matters.

That alone is powerful.

7) Get very good at data cleaning

This is probably the least glamorous skill in analytics.

And also one of the most important.

Real data is messy.

It has:

missing values
inconsistent names
wrong formats
duplicates
broken dates
weird text
bad joins
incomplete records

This is where a lot of analysis goes wrong.

Not because the model was bad.
Not because the dashboard was bad.
But because the data itself was never questioned.

A good analyst keeps asking:

Can I trust this dataset?
Is something missing?
Are these numbers even reasonable?
Are we comparing the right things?
Is the business logic aligned with the data logic?

That mindset matters more than any tool.

8) Move into advanced analytics

Once your basics are strong, then it’s time to go beyond “what happened?”

Now you can start exploring:

A/B testing
regression basics
forecasting basics
retention analysis
cohort analysis
funnel analysis
segmentation
anomaly detection
time series thinking

This is the stage where your work becomes more strategic.

You move from:

describing the past

to:

explaining the present
testing ideas
predicting what may happen next

That shift is a big one.

And it’s also where analysts become much more valuable.

9) Learn how data systems actually work

A strong analyst does not just query tables blindly.

They understand where the data comes from.

That means learning:

relational databases
primary keys and foreign keys
data warehouses
ETL / ELT basics
fact and dimension tables
star schema basics
how pipelines move data from source to dashboard

Later, it also helps to know tools like:

BigQuery
Snowflake
Redshift
dbt

You do not need to become a data engineer.

But you should absolutely understand the flow of data.

Because once you understand the system, your analysis becomes more reliable.

10) Communication is not optional

This is where average analysts and strong analysts start separating.

You can do excellent analysis and still get ignored if you cannot communicate it well.

So learn how to:

write clear summaries
explain insights simply
present recommendations
adapt your language for stakeholders
connect numbers to action

For example:

Instead of saying:

“Revenue increased 12% month-over-month.”

Say:

“Revenue grew 12% compared to last month, mainly driven by repeat customers in the top-performing category.”

That second version is better because it adds meaning.

That’s what people remember.
That’s what gets trusted.

11) Build projects that show business thinking

Projects are where all of this starts coming together.

Not random projects.
Not “I made a chart because I had a CSV.”
Real projects with a question, a process, and an outcome.

Good project ideas:

sales dashboard
customer churn analysis
marketing campaign performance analysis
retention or cohort analysis
SQL case study
Python EDA project
A/B testing case study

For every project, include:

the problem
the dataset
your cleaning process
your analysis
the insight
the recommendation
the final output

The best portfolios do not just show code.

They show judgment.

12) Prepare for interviews like an analyst, not a student

Interview prep is not just memorizing SQL questions.

It’s learning how to explain your work clearly.

Be ready to answer:

What problem were you solving?
What was messy in the data?
How did you clean it?
What did you find?
What recommendation did you make?
What would you improve if you had more time?

Also practice:

SQL interview questions
Excel case studies
dashboard walkthroughs
statistics basics
business case questions

A lot of people know enough to do the work.

But they struggle to explain it.

That’s why practice matters.

13) Learn to use AI as a tool, not a shortcut

This part matters even more now.

The strongest analysts today are not just “Excel + SQL” people.

They know how to use AI to work faster and think better.

That does not mean letting AI do everything.

It means using it well.

For example, AI can help you:

draft SQL queries faster
debug broken code
summarize large datasets
brainstorm KPIs
turn dashboard findings into first-draft business summaries
explore anomalies quickly

But the real job is still yours.

You still need to decide:

what question matters
whether the data is trustworthy
whether the output makes business sense
what action should be taken

So yes, learn AI tools.

But use them like an accelerator, not a crutch.

The order I’d follow

If I were guiding someone from scratch, I’d keep it simple:

Stage 1: Excel / Google Sheets + basic statistics + business understanding
Stage 2: SQL from beginner to advanced
Stage 3: Tableau or Power BI
Stage 4: Python for analytics
Stage 5: Data cleaning and real-world case studies
Stage 6: Advanced analytics
Stage 7: Data systems and warehouses
Stage 8: Portfolio + interview preparation

And the best way to learn is not:

“Finish one giant syllabus and then start building.”

It’s this:

learn a topic
do a small project
learn the next topic
improve the project
repeat

That loop works.

A few YouTube resources that are actually useful

Here are some solid starting points:

1. Alex The Analyst – Data Analyst Bootcamp
A practical full playlist that covers the core analyst stack.
Watch here

2. Luke Barousse – SQL for Data Analytics
Very helpful if you want SQL taught in a clean, job-relevant way.
Watch here

3. freeCodeCamp – Data Analysis with Python
A good beginner-friendly Python course for analysis.
Watch here

4. Alex The Analyst – Excel Tutorials for Data Analysts
Useful for spreadsheet foundations, pivot tables, and cleaning.
Watch here

GitHub projects and repos to explore

If you want hands-on practice, these are good places to start:

1. AlexTheAnalyst / PortfolioProjects
GitHub repo

2. emily1618 / Data-Portfolio
GitHub repo

3. DeviSuhithaChundru / Retail-Data-Analytics-Project-Python-SQL-Integration
GitHub repo

4. lukebarousse / Int_SQL_Data_Analytics_Course
GitHub repo

5. jordanlue / DataQuest-Guided-Projects
GitHub repo

6. amlanmohanty1 / customer-trends-data-analysis-SQL-Python-PowerBI
GitHub repo

Final thought

A lot of people spend months asking:

“Which tool should I learn next?”

A better question is:

“Can I take messy data, find something meaningful, and explain what should happen next?”

Because that is the real job.

Tools matter, yes.

But clear thinking, clean analysis, and strong communication matter more.

That’s what turns someone from “learning analytics” into actually becoming an analyst.

F1 Visa Slots Are Opening: Here’s the Full Legit Process From DS-160 to USTravelDocs

Karthik Adari — Mon, 30 Mar 2026 16:08:34 GMT

F1 Visa Slots Are Opening: Here’s the Full Legit Process From DS-160 to USTravelDocs

If you’re trying to book your F1 visa right now, don’t rush blindly.

A lot of students think the process is just:
fill DS-160 → book slot → attend interview.

But that is not the full picture.

There is a proper order, and if you do things in the wrong sequence, you can delay your application, enter incorrect details, or show up without the right documents.

So here is the full process in a simple, clear way.

Official links you should keep open

Student visa information:
https://travel.state.gov/content/travel/en/us-visas/study/student-visa.html

DS-160 form:
https://ceac.state.gov/genniv/

DS-160 FAQs:
https://travel.state.gov/content/travel/en/us-visas/visa-information-resources/forms/ds-160-online-nonimmigrant-visa-application/ds-160-faqs.html

SEVIS I-901 fee:
https://www.ice.gov/sevis/i901

SEVIS student information:
https://www.ice.gov/sevis/students

USTravelDocs:
https://www.ustraveldocs.com/

Visa wait times:
https://travel.state.gov/content/travel/en/us-visas/visa-information-resources/global-visa-wait-times.html

Visa status check:
https://ceac.state.gov/ceacstattracker/status.aspx

U.S. visas main page / embassy access:
https://travel.state.gov/content/travel/en/us-visas.html

First, understand the correct order

For most new F1 students, the process goes like this:

Get admitted to a SEVP-approved school → receive Form I-20 → pay SEVIS I-901 fee → complete DS-160 → create your visa profile and follow your embassy/USTravelDocs steps → schedule appointment → attend biometrics/interview if required → wait for passport return and visa decision.

The exact appointment and payment flow can vary by country, so always follow the instructions for your specific embassy or consulate.

Step 1: Get admitted and receive your Form I-20

Before you can apply for an F1 visa, you need to be accepted by a SEVP-approved school.

Once admitted, your school registers you in SEVIS and issues your Form I-20.

This document is one of the most important parts of the entire process.

Make sure:

your name matches your passport
your program details are correct
the start date is correct
the I-20 is signed where needed

If dependents are traveling with you, they need their own I-20s.

Step 2: Pay the SEVIS I-901 fee

After receiving the I-20, pay the SEVIS I-901 fee through the official SEVIS website.

Save the payment receipt immediately.

I strongly recommend keeping:

one PDF copy
one screenshot
one printed copy

You may need this during the visa process and interview preparation.

Step 3: Fill out the DS-160 carefully

Next, complete the DS-160 online.

This is your official nonimmigrant visa application form.

While filling it out, keep these ready:

passport
Form I-20
SEVIS ID from the I-20
school address
travel history if applicable
education/work details

Take your time here.

A lot of people make avoidable mistakes in:

passport number
SEVIS ID
university name
personal details
travel history
photo upload

After submitting the DS-160, download and print the confirmation page with the barcode.

That confirmation page is extremely important.

Also save your DS-160 application ID somewhere safe in case you need to retrieve the form later.

Step 4: Upload your visa photo properly

During the DS-160 process, you will usually upload a visa photo.

Make sure the photo follows the official U.S. visa photo requirements.

If the digital upload fails, you may need to carry a printed photo in the required format.

Do not ignore this part. Even small issues with the photo can create unnecessary problems.

Step 5: Create your visa profile and follow your embassy/USTravelDocs process

This is where many students get confused.

After DS-160, go to your specific U.S. embassy or consulate instructions and check whether your country uses USTravelDocs or another scheduling platform.

In many countries, USTravelDocs handles:

profile creation
fee instructions
appointment scheduling
passport tracking
pickup or delivery information

At this stage, do the following:

Open your embassy or consulate’s visa instructions
Create your visa profile
Enter the correct passport and DS-160 details
Follow the fee payment method shown for your location
Schedule your appointment(s)

Do not blindly follow random videos from another country. The process can differ by location.

Step 6: Check visa slot availability and book early

Visa slot availability changes by location, season, and demand.

That is why students should check early and be ready with all documents before trying to book.

Also remember this important point:

You may be able to get the visa well before your course starts, but new F1 students generally cannot enter the U.S. more than 30 days before the program start date on the I-20.

So plan your travel carefully.

Step 7: Be ready for biometrics or multiple appointments

Depending on your location, the process may involve:

biometrics
fingerprints
one appointment
two appointments
VAC plus interview
direct interview flow

This is exactly why country-specific instructions matter.

Before your appointment day, double-check:

location
appointment date
reporting time
document rules
whether electronics or bags are restricted

Step 8: Prepare your complete document folder

Your baseline folder should include:

valid passport
DS-160 confirmation page
visa fee receipt if applicable
printed visa photo if needed
signed Form I-20
SEVIS fee receipt
appointment confirmation page

You should also keep supporting documents ready, such as:

admission letter
transcripts
degree certificates
test scores if relevant
financial documents
scholarship or assistantship letters if any
sponsor documents if someone else is funding you

Even if every document is not always requested, it is much better to be overprepared than underprepared.

Step 9: Prepare for the visa interview properly

Your interview is not just about documents.

You should be able to clearly explain:

what you are going to study
why you chose that university
why you chose that program
how your education will be funded
what your academic background is
what your plans are after completing your studies

Your answers should be clear, honest, and direct.

Do not memorize robotic lines.

Know your own profile well.

Step 10: After the interview

After your interview, keep track of two things:

Your visa application status
Your passport return or pickup status

Depending on your location, USTravelDocs may help with passport tracking and delivery details.

Do not make irreversible travel plans until your passport is returned and your visa is actually issued.

Important travel rule

Even if your visa is approved, the visa itself does not guarantee entry.

Final admission into the United States is decided at the port of entry.

Also remember again:

new F1 students usually cannot enter the U.S. more than 30 days before the program start date listed on the I-20.

Biggest mistakes students should avoid

Here are some of the most common mistakes:

trying to start without the I-20
entering wrong DS-160 details
forgetting to save the DS-160 confirmation page
ignoring embassy-specific instructions
not saving SEVIS payment proof
showing up without financial documents
assuming one country’s process is the same everywhere
booking travel too early
not checking whether biometrics are separate
carrying incomplete or inconsistent documents

My practical checklist before booking a slot

Before trying to book a visa slot, make sure you have:

passport ready
Form I-20 ready
SEVIS fee paid
DS-160 submitted
DS-160 confirmation page saved
photo ready
embassy instructions open
visa profile created
financial documents organized
academic documents organized

If these are ready, your process becomes much smoother.

Final note

The safest rule in the entire F1 visa process is this:

Always follow the official instructions for your specific U.S. embassy or consulate.

20 SQL Interview Questions I’d Practice Before Any Data Interview

Karthik Adari — Fri, 27 Mar 2026 15:03:17 GMT

Whenever I look at SQL interview prep, I notice the same mistake again and again:

A lot of people keep collecting random questions, but they never really master the core patterns.

So in this post, I’m not giving fluff.
I’m sharing 20 standard SQL questions that I believe cover a huge part of what usually gets asked in data analyst, business analyst, BI, and SQL-heavy interview rounds.

I wrote the answers in a very clear way, so this can work both as a practice guide and as a quick revision post before interviews.

Let’s get into it.

1) What is the difference between `INNER JOIN` and `LEFT JOIN`?

My answer:

I think about it like this:

INNER JOIN only returns the matching rows from both tables.
LEFT JOIN returns all rows from the left table, and only the matching rows from the right table. If there is no match, I get NULL values from the right side.

Example:

If I have a customers table and an orders table:

With INNER JOIN, I only see customers who placed orders.
With LEFT JOIN, I see all customers, even the ones who never placed an order.

SELECT c.customer_id, c.customer_name, o.order_id
FROM customers c
INNER JOIN orders o
  ON c.customer_id = o.customer_id;

SELECT c.customer_id, c.customer_name, o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id;

Important interview point:

A LEFT JOIN can accidentally behave like an INNER JOIN if I filter the right table inside the WHERE clause.

Bad example:

SELECT c.customer_id, o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.order_id IS NOT NULL;

That removes the NULL rows, so now I’ve basically turned it into an inner join.

2) What is the difference between `WHERE` and `HAVING`?

My answer:

I use:

WHERE to filter rows before grouping
HAVING to filter groups after aggregation

Example:

If I want customers who placed more than 2 orders:

SELECT customer_id, COUNT(*) AS total_orders
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 2;

If I want to first look only at completed orders, I use WHERE before grouping:

SELECT customer_id, COUNT(*) AS total_orders
FROM orders
WHERE order_status = 'Completed'
GROUP BY customer_id
HAVING COUNT(*) > 2;

Simple memory trick:

WHERE filters raw rows
HAVING filters aggregated results

3) What is the difference between `COUNT(*)`, `COUNT(column)`, and `COUNT(DISTINCT column)`?

My answer:

This is one of those questions that sounds simple, but interviewers love it because many people answer it loosely.

COUNT(*) counts all rows
COUNT(column) counts only non-null values in that column
COUNT(DISTINCT column) counts unique non-null values

Example:

Suppose I have this data in employees:

iddepartment1Sales2Sales3NULL4HR

Then:

SELECT COUNT(*) FROM employees;              -- 4
SELECT COUNT(department) FROM employees;     -- 3
SELECT COUNT(DISTINCT department) FROM employees; -- 2

Interview tip:

I always mention null handling here, because that’s usually what they want to test.

4) How do I find the 2nd highest salary?

My answer:

There are multiple ways. The best answer depends on whether ties matter.

If I want the second distinct highest salary:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (
  SELECT MAX(salary)
  FROM employees
);

Better scalable version using `DENSE_RANK()`:

WITH ranked_salaries AS (
  SELECT salary,
         DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
)
SELECT salary
FROM ranked_salaries
WHERE rnk = 2;

Why I like this answer:

It handles ties better and is easier to extend to 3rd, 4th, or nth highest salary.

5) What is the difference between `ROW_NUMBER()`, `RANK()`, and `DENSE_RANK()`?

My answer:

All three are window functions used for ranking, but they behave differently with ties.

Example data:

namescoreA95B95C90

Behavior:

ROW_NUMBER() gives unique numbers no matter what
→ 1, 2, 3
RANK() gives the same rank to ties, but skips the next rank
→ 1, 1, 3
DENSE_RANK() gives the same rank to ties, but does not skip ranks
→ 1, 1, 2

Query:

SELECT name,
       score,
       ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
       RANK() OVER (ORDER BY score DESC) AS rank_num,
       DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank_num
FROM scores;

When I use each:

ROW_NUMBER() when I need exactly one row per group
RANK() when ranking competition-style positions
DENSE_RANK() when I care about distinct ranking levels

6) Write a query using `GROUP BY` and `HAVING`

My answer:

A very common example is finding customers with at least 2 orders.

SELECT customer_id,
       COUNT(*) AS total_orders
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 2;

If I want customers whose revenue is above 1000:

SELECT customer_id,
       SUM(order_amount) AS total_revenue
FROM orders
GROUP BY customer_id
HAVING SUM(order_amount) > 1000;

What I keep in mind:

Every non-aggregated column in the SELECT clause usually needs to appear in GROUP BY.

7) Why do joins create duplicate rows? How do I fix it?

My answer:

Joins create duplicate rows when the relationship is not one-to-one.

For example:

one customer can have many orders
one order can have many items

So when I join tables, rows multiply.

Example:

If one customer has 3 orders, joining customers and orders gives 3 rows for that customer.

How I fix it:

I don’t start with DISTINCT blindly. I first check:

What is the grain of each table?
Is the join one-to-many or many-to-many?
Am I missing a join condition?
Do I actually need aggregation before joining?

Common fixes:

aggregate first
use the correct join keys
deduplicate source rows
use ROW_NUMBER() to keep only the latest row if needed

Example:

WITH latest_orders AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
  FROM orders
)
SELECT customer_id, order_id, order_date
FROM latest_orders
WHERE rn = 1;

8) What is the difference between a CTE and a subquery?

My answer:

Both help me break down complex queries.

A subquery is written inside another query
A CTE (WITH clause) is like a temporary named result set that makes the logic easier to read

Subquery example:

SELECT employee_name, salary
FROM employees
WHERE salary > (
  SELECT AVG(salary)
  FROM employees
);

CTE version:

WITH avg_salary_cte AS (
  SELECT AVG(salary) AS avg_salary
  FROM employees
)
SELECT employee_name, salary
FROM employees, avg_salary_cte
WHERE employees.salary > avg_salary_cte.avg_salary;

When I use each:

I use subqueries for smaller, simpler logic
I use CTEs when the query gets longer and I want readability or multi-step logic

9) How do I calculate a running total in SQL?

My answer:

I use a window function with SUM().

SELECT order_date,
       sales_amount,
       SUM(sales_amount) OVER (
         ORDER BY order_date
       ) AS running_total
FROM sales;

Why this works:

The window function keeps adding values in sorted order.

If I need a running total by customer:

SELECT customer_id,
       order_date,
       sales_amount,
       SUM(sales_amount) OVER (
         PARTITION BY customer_id
         ORDER BY order_date
       ) AS customer_running_total
FROM sales;

Interview point:

I like this answer because it shows I understand window functions, not just basic aggregation.

10) How do I get the latest record for each user?

My answer:

This is one of the most common real interview questions.

I usually solve it with ROW_NUMBER().

WITH ranked_records AS (
  SELECT user_id,
         status,
         updated_at,
         ROW_NUMBER() OVER (
           PARTITION BY user_id
           ORDER BY updated_at DESC
         ) AS rn
  FROM user_status
)
SELECT user_id, status, updated_at
FROM ranked_records
WHERE rn = 1;

Why I like this:

It’s clear, scalable, and works well in most SQL dialects.

11) How do I write conditional aggregation?

My answer:

I use CASE WHEN inside aggregation functions.

Example:

Suppose I want total orders, completed orders, and canceled orders by month.

SELECT DATE_TRUNC('month', order_date) AS month,
       COUNT(*) AS total_orders,
       SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) AS completed_orders,
       SUM(CASE WHEN order_status = 'Canceled' THEN 1 ELSE 0 END) AS canceled_orders
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

Why this matters:

A lot of dashboard-style metrics come from conditional aggregation.

12) How do I use `LAG()` and `LEAD()`?

My answer:

I use them when I want to compare the current row with a previous or next row.

LAG() looks backward
LEAD() looks forward

Example:

Day-over-day sales change:

SELECT order_date,
       sales_amount,
       LAG(sales_amount) OVER (ORDER BY order_date) AS previous_day_sales,
       sales_amount - LAG(sales_amount) OVER (ORDER BY order_date) AS sales_change
FROM daily_sales;

Where this shows up:

month-over-month growth
previous login date
churn or reactivation analysis
comparing current vs prior state

13) How do I find customers who placed orders on consecutive days?

My answer:

There are a few ways, but the cleanest approach often uses window functions.

WITH ordered_dates AS (
  SELECT customer_id,
         order_date,
         LAG(order_date) OVER (
           PARTITION BY customer_id
           ORDER BY order_date
         ) AS prev_order_date
  FROM orders
)
SELECT customer_id, order_date, prev_order_date
FROM ordered_dates
WHERE order_date = prev_order_date + INTERVAL '1 day';

What this shows:

I can compare one event with the previous event for the same customer.

14) How do I find duplicate records?

My answer:

I first define what “duplicate” means from a business perspective.

For example, if duplicate orders mean same customer_id, product_id, and order_date, then:

SELECT customer_id,
       product_id,
       order_date,
       COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_id, product_id, order_date
HAVING COUNT(*) > 1;

If I need to keep only one row:

I use ROW_NUMBER():

WITH deduped AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id, product_id, order_date
           ORDER BY created_at DESC
         ) AS rn
  FROM orders
)
SELECT *
FROM deduped
WHERE rn = 1;

15) How do I return the top 3 highest-paid employees in each department?

My answer:

This is a classic “top N per group” problem.

WITH ranked_employees AS (
  SELECT employee_name,
         department,
         salary,
         DENSE_RANK() OVER (
           PARTITION BY department
           ORDER BY salary DESC
         ) AS rnk
  FROM employees
)
SELECT employee_name, department, salary
FROM ranked_employees
WHERE rnk <= 3;

Why I use `DENSE_RANK()`:

It handles ties better than ROW_NUMBER() if I want all employees tied within the top 3 salary levels.

16) Why can date filtering go wrong with timestamps?

My answer:

A lot of people use BETWEEN carelessly and miss rows.

For example, this can be risky:

WHERE order_timestamp BETWEEN '2025-01-01' AND '2025-01-31'

Because timestamps after midnight on January 31 may get excluded depending on the database and formatting.

Safer version:

WHERE order_timestamp >= '2025-01-01'
  AND order_timestamp < '2025-02-01'

Why I prefer this:

It makes the boundary condition much cleaner.

17) How do I handle `NULL` values in SQL?

My answer:

I always remember that NULL means unknown or missing, and it behaves differently from regular values.

Important points:

COUNT(column) ignores nulls
COUNT(*) does not
NULL = NULL is not true
I need IS NULL or IS NOT NULL

Example:

SELECT *
FROM employees
WHERE manager_id IS NULL;

Using `COALESCE()`:

If I want a fallback value:

SELECT employee_name,
       COALESCE(bonus, 0) AS bonus_amount
FROM employees;

Interview point:

I make sure not to say null is equal to zero or blank. That’s a common mistake.

18) How do I find median salary by department?

My answer:

This depends on SQL dialect.

If the database supports percentile functions:

SELECT department,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees
GROUP BY department;

If not:

I explain the manual approach:

sort salaries within each department
assign row numbers
pick the middle row for odd counts
average the two middle rows for even counts

What the interviewer usually wants:

They want to see whether I understand the logic, even if the exact syntax changes by database.

19) How do I optimize a slow SQL query?

My answer:

I usually talk through a structured process instead of guessing.

My checklist:

Check the execution plan
See which table is scanning too many rows
Confirm indexes on join and filter columns
Filter early when possible
Avoid selecting unnecessary columns
Check for expensive joins or many-to-many joins
Aggregate before joining if it reduces data
Remove unnecessary nested logic or repeated calculations

Example answer in interviews:

If a query is slow, I don’t jump straight to rewriting it. I first check where the cost is coming from, then I look at indexes, joins, filters, and row explosion.

That usually sounds much stronger than giving random optimization buzzwords.

20) “Sales dropped in one region.” How would I investigate it using SQL?

My answer:

I would break the problem step by step instead of jumping to a conclusion.

My approach:

Confirm the time period of the drop
Compare the affected region with other regions
Break sales into:
- number of orders
- number of customers
- average order value
- conversion rate
Drill down by:
- product category
- channel
- customer segment
- city or store
- date or week
Check whether:
- traffic dropped
- conversion dropped
- pricing changed
- cancellations increased
- one major product underperformed

Example SQL direction:

SELECT region,
       DATE_TRUNC('week', order_date) AS week,
       COUNT(DISTINCT order_id) AS total_orders,
       SUM(sales_amount) AS total_sales,
       AVG(sales_amount) AS avg_order_value
FROM sales
GROUP BY region, DATE_TRUNC('week', order_date)
ORDER BY week, region;

Then I would keep slicing until I find the main driver.

Why this is a strong answer:

Because business SQL interviews are often less about one fancy query and more about whether I can investigate in a structured way.

How I’d Actually Practice These

If I were preparing seriously, I wouldn’t just read these once.

I’d do this:

write each query from memory
explain each answer out loud in simple words
practice with small sample tables
then solve variations of the same pattern

That’s what usually builds real interview confidence.

Not random memorization.
Pattern recognition.

Final Thought

Whenever I look at SQL interviews, I keep coming back to one thing:

The questions may look different on the surface, but the underlying patterns are often the same.

That’s why I’d rather master these 20 really well than try to collect 200 random questions without depth.

If I can confidently handle joins, grouping, ranking, window functions, duplicates, nulls, and business investigation questions, I’m already in a much stronger position for most SQL-heavy interviews.

Zero to AI Engineer in 90 Days

Karthik Adari — Tue, 10 Mar 2026 23:05:23 GMT

If I were starting from scratch today, this is the roadmap I’d follow

If I had to start from zero today and aim for an AI Engineer role, I would not begin with random AI tools or only prompt engineering.

I would build around the skills that keep showing up in real roles: Python, SQL, machine learning, data prep, evaluation, deployment, inference, and production systems. Current openings from Google, Amazon, and OpenAI still point strongly in that direction, with repeated emphasis on Python/SQL, ML modeling, evaluation, deployment, scalability, and getting systems into production.

So if I were creating a roadmap today, my focus would be simple:

Don’t just learn AI. Learn how to build and ship it.

One honest note: 90 days is enough to become strong and portfolio-ready, not enough to master everything. The goal is to build momentum fast, create real projects, and become someone who can actually solve problems with AI.

Subscribe now

Days 1–14: I would build the foundation first

Before touching deep learning, LLMs, or agents, I’d make sure I can code properly and work with data.

In the first two weeks, I would focus on:

Python
Git and GitHub
SQL
Pandas / NumPy basics
math intuition for ML, especially linear algebra and statistics

For Python, I’d start with Kaggle Learn: Python if I wanted something short and interactive, and pair it with a freeCodeCamp Python full course on YouTube if I wanted a longer video-based walkthrough. Kaggle’s Learn portal also offers free tracks for Pandas and other data skills. (Kaggle)

For SQL, I’d use Kaggle Learn: Intro to SQL for a quick practical start, and then a freeCodeCamp SQL full course on YouTube for a longer video format. (Kaggle)

For math, I’d use Khan Academy’s free Linear Algebra and Statistics & Probability tracks, because they’re fully free and beginner-friendly. (Google for Developers)

By the end of these 14 days, I’d want to be comfortable:

writing small Python programs on my own
reading and cleaning data
pushing code to GitHub
writing basic SQL queries
understanding concepts like vectors, matrices, averages, variance, and probability

That foundation matters a lot more than people think.

Days 15–30: I would learn machine learning properly

Once the basics are in place, I’d move into classical machine learning.

This is where I’d learn:

supervised vs unsupervised learning
regression vs classification
train / validation / test split
overfitting and underfitting
feature engineering
model evaluation
error analysis
model comparison

My main free resource here would be Google’s Machine Learning Crash Course, which Google describes as a fast-paced, practical introduction with videos, visualizations, and hands-on exercises. I’d pair that with Kaggle’s Intro to Machine Learning and Intermediate Machine Learning for practice. (Google for Developers)

For conceptual clarity, I’d also use StatQuest on YouTube, especially for confusion matrix, bias-variance, metrics, cross-validation, and model intuition. (Google for Developers)

This is also the stage where I would build my first ML project:

spam detection
customer churn prediction
house price prediction
loan default prediction

Not just the notebook. I’d also explain:

what problem I solved
what data I used
which models I tried
what metric I chose
what mistakes the model still makes

That habit is what starts turning learning into engineering.

Days 31–45: I would learn deep learning with PyTorch

Once I understand classical ML, I’d move into deep learning.

Here I’d focus on:

tensors
datasets and dataloaders
neural networks
loss functions
optimizers
training loops
transfer learning

For this phase, I’d start with the PyTorch official beginner tutorials, especially Learn the Basics, because they walk through a complete ML workflow in PyTorch. I’d pair that with a freeCodeCamp PyTorch course on YouTube if I wanted a longer guided walkthrough. If I wanted an extra mini-course, Kaggle’s Intro to Deep Learning is also free. (PyTorch Docs)

By the end of this phase, I’d build one deep learning project like:

image classifier
sentiment classifier
document classifier
resume classifier

The goal here wouldn’t be fancy research. It would be understanding how a model trains, how loss changes, how to detect overfitting, and how to evaluate results properly.

Days 46–60: I would learn LLMs, transformers, and RAG

This is where the roadmap becomes modern AI engineering.

At this stage, I’d focus on:

transformers
tokenization
embeddings
prompting
vector search
retrieval-augmented generation
hallucination analysis
evaluation for LLM apps

My main free resource here would be the Hugging Face LLM Course, which is fully free and teaches large language models and NLP using the Hugging Face ecosystem. (Hugging Face)

For RAG, I’d use freeCodeCamp’s RAG & MCP Fundamentals or RAG Fundamentals and Advanced Techniques, both free video-based resources aimed at helping learners understand document embeddings, vector databases, and building retrieval systems. (FreeCodeCamp)

If I wanted to understand how LLMs work under the hood, I’d also use freeCodeCamp’s “Code an LLM From Scratch” or a long-form coding workshop that walks through implementing LLM ideas directly. (FreeCodeCamp)

At this stage, I would build my first real AI application:

PDF Q&A assistant
resume review assistant
interview prep assistant
company research assistant

And I would test it seriously:

when retrieval fails
when answers become ungrounded
when hallucinations appear
how I can improve relevance and latency

That mindset is much closer to real AI engineering than simply saying, “I built a chatbot.”

Subscribe now

Days 61–75: I would learn APIs, deployment, agents, and MLOps

This is the part many beginners skip, but this is often where the “engineer” part actually begins.

At this stage, I’d focus on:

FastAPI
Docker
model serving
experiment tracking
monitoring basics
agent workflows

For APIs, I’d use the FastAPI official tutorial, which the project itself describes as the official and recommended way to learn FastAPI. For Docker, I’d use Docker Get Started. Both are free. (FastAPI)

For MLOps basics, I’d use freeCodeCamp’s MLflow + Databricks MLOps course and supplement it with the MLflow getting-started docs if I needed reference material. (FreeCodeCamp)

For agents, I’d use the Hugging Face Agents Course, which explicitly states that it is a free course for understanding, using, and building AI agents. (Hugging Face)

For production thinking, I’d also use Full Stack Deep Learning and Made With ML, both of which provide free materials focused on building production-grade ML/AI systems. (Full Stack Deep Learning)

By the end of this phase, I would take one earlier project and turn it into something more serious:

expose it through an API
Dockerize it
add experiment tracking
write a clean README
document limitations and failure cases

That alone teaches a huge amount.

Days 76–90: I would build one flagship capstone

Now I would stop collecting tutorials and start shipping something real.

This last phase would be about building one project that brings everything together:

data ingestion
model or LLM logic
evaluation
API serving
deployment
documentation
debugging

I’d choose one path based on interest.

If I wanted to become an LLM / GenAI Engineer

I’d go deeper into:

RAG pipelines
agent systems
evals
retrieval optimization
latency and cost trade-offs

My free stack would be:

Hugging Face LLM Course
Hugging Face Agents Course
freeCodeCamp RAG courses
Full Stack Deep Learning (Hugging Face)

If I wanted to become a Computer Vision AI Engineer

I’d go deeper into:

CNNs
transfer learning
image classification
object detection basics
vision embeddings

My free stack would be:

PyTorch tutorials
Kaggle Intro to Deep Learning
additional freeCodeCamp deep learning / PyTorch videos (PyTorch Docs)

If I wanted to become an Applied ML / MLOps Engineer

I’d go deeper into:

training pipelines
experiment tracking
deployment
monitoring
ML system design

My free stack would be:

Google ML Crash Course
MLflow resources
Full Stack Deep Learning
Made With ML (Google for Developers)

The three projects I’d want by Day 90

If I really wanted to be job-ready, I would want these three finished:

One classical ML project
One deep learning project
One deployed AI application using LLMs, RAG, or agents

Why these three?

Because together they show:

I understand ML fundamentals
I can work with deep learning tools
I can build and ship real AI systems

And that lines up well with what current roles are asking for: scalable ML solutions, working with large datasets, evaluation, deployment, production readiness, and end-to-end execution.

Subscribe now

The free resources I’d personally keep bookmarked

These are the ones I’d keep open throughout the journey:

Kaggle Learn: Python (Kaggle)
freeCodeCamp Python full course on YouTube (YouTube)
Kaggle Learn: Intro to SQL (Kaggle)
freeCodeCamp SQL full course on YouTube (YouTube)
Khan Academy: Linear Algebra / Statistics (Google for Developers)
Google Machine Learning Crash Course (Google for Developers)
Kaggle: Intro to ML / Intermediate ML / Intro to Deep Learning (Class Central)
PyTorch Tutorials (PyTorch Docs)
Hugging Face LLM Course (Hugging Face)
freeCodeCamp RAG courses (FreeCodeCamp)
FastAPI official tutorial (FastAPI)
Docker Get Started (Docker)
Hugging Face Agents Course (Hugging Face)
Full Stack Deep Learning (Full Stack Deep Learning)
freeCodeCamp MLflow / MLOps course (FreeCodeCamp)

If I were starting from zero today, I would not try to learn all of AI at once.

I would focus on coding, ML foundations, deep learning, LLM apps, deployment, and real projects - because that’s what actually moves me closer to becoming an AI Engineer.

START Framework: The Structured Job Search Blueprint (2026 Edition)

Karthik Adari — Mon, 23 Feb 2026 15:38:48 GMT

When you commented START, you weren’t asking for motivation.

You were asking for direction.

So here it is.

This is the exact framework I would follow if I had to restart my job search today with zero advantage.

No referrals.
No brand name company.
No shortcuts.

Just strategy.

S - Select One Role (60-Day Focus Rule)

The biggest mistake I see?

People apply to 5 roles at once.

Data Analyst.
Data Engineer.
ML Engineer.
Business Analyst.
Product Analyst.

Each role has:

Different resume keywords
Different project expectations
Different interview patterns

Pick ONE role for 60 days.

Commit.

Clarity increases response rate more than volume ever will.

T - Track Market Signals (Reverse Engineer Demand)

Open 20 recent job descriptions.

Create a simple sheet:

SkillFrequencyMandatory or Preferred

You’ll quickly notice:

For Data Analyst roles:

SQL appears almost everywhere
Visualization tools matter
Stakeholder communication is often hidden but critical

The market leaves clues.

Most people don’t collect them.

If you want to speed this up, tools like job aggregators (including FoxHunt AI) help filter active listings quickly instead of scrolling reposted jobs. But even manually, this step is non-negotiable.

A - Assemble 2 Targeted Projects

Not 10 small projects.

Two strong, business-aligned ones.

Project 1: Revenue / Operations / Growth problem
Project 2: Automation or Efficiency problem

Each project should include:

Clear problem statement
Data cleaning explanation
Metrics before vs after
Visual dashboard
Hosted demo link

Recruiters don’t care about how many notebooks you have.

They care whether you can solve a business problem.

R - Refine Resume Around One Identity

Your resume should answer one question:

“What problem does this candidate solve?”

Most resumes fail because they try to impress everyone.

Strong resumes position you clearly for one role.

Checklist:

Remove unrelated skills
Add numbers (at least 60–70% quantified bullets)
Keep bullet length between 10–30 words
Use tools mentioned in job descriptions
Remove vague buzzwords
Avoid responsibility-based phrases like “Responsible for”

Your resume should feel intentional, not crowded.

When I was actively applying, I used this structured resume framework for my own job hunt:

Resume

It’s built around the exact principles above:

Role-specific alignment
Keyword optimization
Clean formatting
Strong quantified bullets

But remember - tools only amplify clarity.
They don’t replace it.

Your story still matters more than any template.

T - Target Applications Strategically

Instead of:
200 random applications

Do:
30 high-quality, early applications.

Apply within 24 hours of posting.

Why?

Because recruiters review in batches.
Late applications often get buried.

Speed + relevance > mass applying.

Bonus Layer: Outreach That Doesn’t Annoy

Bad message:
“Can you refer me?”

Better message:
“I noticed you’re a Data Analyst at X. I’m preparing for similar roles and curious what tools your team uses most daily.”

Curiosity creates conversation.
Conversation creates opportunity.

What We’re Building Here

This Substack is not about motivation.

It’s about systems.

In the next issues, I’ll break down:

Exact project templates by role
Resume bullet transformation examples
Outreach scripts that get responses
Interview question patterns by company type
How to track applications like a sales pipeline

And occasionally, I’ll share tools and systems I’m building that make this process faster and more structured.

But the strategy always comes first.

Capgemini Data Analyst (L1) – Complete Interview Questions & Answers

Karthik Adari — Thu, 19 Feb 2026 17:09:23 GMT

He told me something interesting:

“They didn’t test advanced stuff.
They tested whether I understand the basics clearly.”

That’s the pattern.

If your fundamentals are strong and you can explain your thinking step by step, you’re already ahead.

Here are the exact questions + complete answers explained simply.

1. INNER JOIN vs LEFT JOIN

INNER JOIN

Returns only matching records from both tables.

SELECT e.name, d.department
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.id;

Only employees who have matching department IDs will appear.

LEFT JOIN

Returns all records from the left table + matching records from right table.

SELECT e.name, d.department
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.id;

Employees without a department will still appear (department = NULL).

👉 Interview Tip: Always explain with a real business example.

2. WHERE vs HAVING

WHERE

Filters rows before grouping.

SELECT * 
FROM sales
WHERE region = 'East';

HAVING

Filters after GROUP BY.

SELECT region, SUM(revenue)
FROM sales
GROUP BY region
HAVING SUM(revenue) > 10000;

👉 Rule:
WHERE → rows
HAVING → aggregated results

3. Find Duplicate Records

SELECT name, COUNT(*)
FROM customers
GROUP BY name
HAVING COUNT(*) > 1;

This shows duplicated names.

4. Remove Duplicates (Keep One)

Using ROW_NUMBER():

DELETE FROM customers
WHERE id IN (
  SELECT id FROM (
    SELECT id,
           ROW_NUMBER() OVER(PARTITION BY name ORDER BY id) AS rn
    FROM customers
  ) t
  WHERE rn > 1
);

Keep rn = 1, delete others.

5. Second Highest Salary

Basic approach:

SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

Using DENSE_RANK:

SELECT salary
FROM (
  SELECT salary,
         DENSE_RANK() OVER (ORDER BY salary DESC) rnk
  FROM employees
) t
WHERE rnk = 2;

👉 DENSE_RANK handles ties properly.

6. COUNT(*) vs COUNT(column)

COUNT(*) → counts all rows
COUNT(column) → ignores NULL values

Example:

If 5 rows exist and 2 salary values are NULL:

COUNT(*) = 5
COUNT(salary) = 3

7. GROUP BY Basics

Used with aggregate functions.

SELECT department, SUM(salary)
FROM employees
GROUP BY department;

Common mistake:
Selecting a column not included in GROUP BY without aggregation.

8. Primary Key vs Foreign Key

Primary Key:

Unique
Cannot be NULL
Identifies a row

Foreign Key:

Links to primary key in another table
Maintains relationship

Business Example:
Customer table → Orders table

9. Normalization (1NF, 2NF, 3NF)

1NF:

No repeating groups
Atomic values

2NF:

Remove partial dependency

3NF:

Remove transitive dependency

Purpose:

Avoid redundancy
Improve data consistency

10. Excel: VLOOKUP vs XLOOKUP

VLOOKUP:

Searches left to right only
Needs column number

XLOOKUP:

More flexible
Works both directions
Handles errors better

Example:

=XLOOKUP(A2, A:A, B:B)

11. When to Use Pivot Table?

To:

Summarize large data
Calculate totals by category
Create quick KPI reports

Example:
Sales by Region → Drag Region to Rows, Revenue to Values.

12. Power BI: Measure vs Calculated Column

Calculated Column:

Computed row by row
Stored in model

Measure:

Calculated dynamically
Based on filter context

Example:

Measure:

Total Sales = SUM(Sales[Amount])

Use Measure for KPIs.

13. Handling Missing Values

Options:

Remove rows
Replace with mean/median
Replace using business rule
Keep NULL (if meaningful)

Always explain WHY you choose a method.

14. Detecting Outliers

Methods:

IQR Method
Z-score
Visual inspection (boxplot)

IQR formula:

Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

15. Scenario: “Sales Dropped Last Month”

Steps:

Check if data is correct
Compare month-over-month trends
Break down by:
- Region
- Product
- Customer segment
Check pricing or discount changes
Validate external factors

👉 Interviewers test thinking process, not just tools.

Final Advice for Capgemini L1

They look for:

Strong SQL basics
Clear explanation
Logical thinking
Structured approach
Confidence in fundamentals

Not advanced AI.
Not complex ML.

Just clarity.

5 Free Data Certifications You Can Earn This Week (No Money Needed)

Karthik Adari — Thu, 29 Jan 2026 04:37:45 GMT

Most people delay certifications because they assume it costs money.

Not true.

Here are 5 legit, industry-recognized credentials you can complete for $0 — and each one strengthens your profile for Data Analyst / Data Engineer / BI roles.

If you’re job hunting, pick 2 and finish them fast.
If you’re building a strong portfolio, do all 5 over the next month.

1) IBM SkillsBuild — Data Analytics (Free Digital Credentials)

✅ Best for: Beginners → Intermediate, structured learning + shareable credential
🎯 What you’ll learn: data analysis basics, data literacy, reporting mindset, real-world analytics workflows
📌 Why it helps: IBM credentials look strong on LinkedIn and help you show “I’m learning consistently.”

🔗 Link: https://skillsbuild.org/students/digital-credentials

Tip: Add this to LinkedIn as:
Licenses & Certifications → IBM SkillsBuild → Data Analytics (Credential)

2) Snowflake — Hands-On Essentials Track (Free Badges)

✅ Best for: Data Engineers / Analytics Engineers, modern warehouse skills
🎯 What you’ll learn: Snowflake concepts, warehouses/databases, loading data, querying, basics of performance
📌 Why it helps: Snowflake is widely used in analytics teams — this shows you can work with modern stacks.

🔗 Link: https://learn.snowflake.com/en/pages/hands-on-essentials-track/

Tip: Pair this with a small project:
“Load a CSV into Snowflake → run SQL queries → build a simple dashboard summary.”

3) HackerRank — SQL (Basic) Skills Certification Test

✅ Best for: Interview prep, proof of SQL fundamentals
🎯 What you’ll be tested on: SELECT, WHERE, joins basics, aggregations, grouping, simple subqueries
📌 Why it helps: Recruiters love quick proof. This is a clean “pass/fail credential” you can show fast.

🔗 Link: https://www.hackerrank.com/skills-verification/sql_basic

Tip: Do it after practicing 30–50 problems. Your pass badge becomes a strong signal.

4) Alteryx — Designer Core (Certification Exam Listing)

✅ Best for: Analytics + ETL automation, drag-and-drop workflows
🎯 What you’ll learn: data prep, joins, unions, transformations, workflow logic
📌 Why it helps: Many companies use Alteryx for BI automation. This stands out in analyst roles.

🔗 Link: https://community.alteryx.com/t5/Certification-Exams/bd-p/product-certification

Tip: If you’re targeting analyst roles, this can be a “differentiator” when others only list Excel.

5) MongoDB — Skill Badges (Free, Shareable)

✅ Best for: Data + Backend + NoSQL, modern document databases
🎯 What you’ll learn: querying with MongoDB, filtering, aggregations, schema design basics
📌 Why it helps: A lot of startups (and even big companies) use MongoDB. Knowing it makes you versatile.

🔗 Link: https://learn.mongodb.com/skills/

Tip: Add a simple project to GitHub:
“Store job postings → query by location/skills → build a basic analytics summary.”

Quick Plan (So You Actually Finish)

If your goal is Data Analyst:

Start with HackerRank SQL (Basic)
Then do IBM SkillsBuild
Add Snowflake as a bonus if you want modern tools

If your goal is Data Engineer:

Snowflake → MongoDB → HackerRank
Then Alteryx if your target jobs mention it

LinkedIn
Instagram

Accenture Data Analyst Interview: 15 Questions + Answers (With Short Explanations)

Karthik Adari — Wed, 28 Jan 2026 15:19:39 GMT

If you’re preparing, bookmark this and practice the same set.

1) INNER JOIN vs LEFT JOIN (real scenario)

✅ Answer

INNER JOIN returns only matching rows from both tables.
LEFT JOIN returns all rows from the left table + matches from the right table (unmatched becomes NULL).

Example (Customers + Orders)

-- Only customers who placed orders
SELECT c.customer_id, o.order_id
FROM customers c
INNER JOIN orders o
  ON c.customer_id = o.customer_id;

-- All customers (even if they placed no orders)
SELECT c.customer_id, o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id;

When to use what

Use INNER JOIN when you only want “existing relationships.”
Use LEFT JOIN when you want a full list from left table (like all customers, all products, all employees).
Subscribe now

2) WHERE vs HAVING (with example)

✅ Answer

WHERE filters rows before aggregation.
HAVING filters results after aggregation.

-- Filter rows before grouping (only completed orders)
SELECT customer_id, COUNT(*) AS total_orders
FROM orders
WHERE status = 'Completed'
GROUP BY customer_id
HAVING COUNT(*) >= 2;

Rule of thumb

Use WHERE for columns.
Use HAVING for aggregates like COUNT, SUM, AVG.

3) SQL: 2nd Highest Salary (handle ties)

✅ Solution (best way: DENSE_RANK)

SELECT salary
FROM (
  SELECT salary,
         DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
) s
WHERE rnk = 2;

Short explanation

DENSE_RANK() assigns same rank to ties.
The second highest distinct salary always has rank 2.
Share

4) ROW_NUMBER vs RANK vs DENSE_RANK

✅ Answer

ROW_NUMBER() gives unique number (no ties).
RANK() gives same rank for ties but leaves gaps (1,1,3).
DENSE_RANK() gives same rank for ties without gaps (1,1,2).

SELECT employee_id, salary,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn,
       RANK() OVER (ORDER BY salary DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS drnk
FROM employees;

5) GROUP BY + HAVING (customers with ≥2 orders)

SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 2;

If revenue threshold is needed:

SELECT customer_id, SUM(amount) AS total_spend
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 1000;

Explanation: HAVING is used because we’re filtering aggregated values.

6) Joins create duplicate rows: Why? How to fix?

✅ Why duplicates happen

When the relationship isn’t 1-to-1, joins can multiply rows.

Example:

One customer has 3 orders
You join customers + orders → customer row appears 3 times

✅ Fix options

A) Use DISTINCT (quick fix, not always correct)

SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

B) Aggregate before joining (best practice)

SELECT c.customer_id, o.total_orders
FROM customers c
LEFT JOIN (
  SELECT customer_id, COUNT(*) AS total_orders
  FROM orders
  GROUP BY customer_id
) o
ON c.customer_id = o.customer_id;

7) COUNT(*) vs COUNT(column)

✅ Answer

COUNT(*) counts all rows (including NULLs).
COUNT(column) counts only non-NULL values in that column.

SELECT COUNT(*) AS total_rows,
       COUNT(email) AS non_null_emails
FROM users;

8) CTE vs Subquery (when to use each)

✅ Answer

Both do the same job. CTE improves readability and reuse.

Subquery

SELECT *
FROM (
  SELECT customer_id, SUM(amount) AS total_spend
  FROM orders
  GROUP BY customer_id
) t
WHERE total_spend > 1000;

CTE (Cleaner)

WITH spend AS (
  SELECT customer_id, SUM(amount) AS total_spend
  FROM orders
  GROUP BY customer_id
)
SELECT *
FROM spend
WHERE total_spend > 1000;

Use CTE when

logic is long
multiple steps are needed
you want clean debugging

9) Excel: VLOOKUP vs XLOOKUP vs INDEX-MATCH

✅ Answer

VLOOKUP: older, left-to-right only, breaks if columns move.
XLOOKUP: modern, flexible, supports left lookup, easy.
INDEX-MATCH: powerful and stable, works everywhere.

XLOOKUP example

=XLOOKUP(A2, Customers!A:A, Customers!C:C, "Not Found")

10) Excel: Pivot Table for KPIs

✅ Steps (short and practical)

Insert → Pivot Table
Put Category/Region in Rows
Put Sales/Revenue in Values (SUM)
Add Month in Columns for trend
Add Filters like Product, Channel

Why it’s asked: Shows you can summarize data fast for business.

11) Handling Missing Values + Outliers (IQR/Z-score/business)

✅ Missing values strategies

Remove if small % and random
Impute:
- numeric: mean/median
- categorical: mode
Or fill with “Unknown” for business clarity

✅ Outlier strategies

Confirm if it’s data error or true extreme
Handle using:
- IQR method
- Z-score
- capping/winsorizing
- log transform

Best answer tip: Always mention “business context decides.”

12) Power BI: Measures vs Calculated Columns

✅ Answer

Calculated Column: computed at refresh time, stored in model.
Measure: computed at query time, depends on filter context.

Example

Column: Profit = Sales - Cost (stored per row)
Measure: Total Profit = SUM(Sales) - SUM(Cost) (changes with slicers)

In interviews: prefer measures for aggregations and dashboards.

13) DAX: What does CALCULATE() do?

✅ Answer

CALCULATE() changes the filter context and then evaluates an expression.

Example: Sales for only “Online” channel

Online Sales =
CALCULATE(
  SUM(Sales[Amount]),
  Sales[Channel] = "Online"
)

Short explanation

It’s the most powerful DAX function
Used to apply filters dynamically

14) Make Power BI reports faster (huge data)

✅ Best optimization checklist

Reduce columns (remove unused fields)
Use Star Schema (fact + dimensions)
Avoid high-cardinality columns in visuals
Prefer measures over calculated columns
Use Aggregations and Incremental Refresh (if available)
Reduce visuals per page, avoid heavy custom visuals
Optimize DAX (avoid iterators when possible)

Interview punchline: Model first, DAX next, visuals last.

15) Case: “Sales dropped in one region” — How would you investigate?

✅ Strong structured approach (interview-ready)

Confirm the drop

Compare MoM, WoW, YoY
Check if it’s seasonality

Break down

Product, channel, customer segment
New vs returning customers

Check operations

Stockouts, delayed deliveries, pricing changes
Returns/refunds increased?

Check marketing

Campaign paused? CPC increased? traffic dropped?

Check data issues

ETL failure, missing transactions, time zone cutoffs

Share recommendation

Root cause + expected impact + next steps

Accenture loves: structured thinking + business storytelling.

#Accenture #DataAnalyst #SQL #PowerBI #Excel #DAX #InterviewPrep #DataAnalytics #BusinessAnalytics #JobSearch #Freshers #Analytics #CareerGrowth

5 Real Data Science Projects You Can Copy From GitHub (and a Complete Roadmap to Become Job-Ready)

Karthik Adari — Sun, 25 Jan 2026 04:33:26 GMT

Most people say “I know scikit-learn” or “I know MLflow.”

Hiring managers don’t hire that.

They hire this:
A working system that goes from data → model → evaluation → deployment → UI/insights.

Below are 5 real, end-to-end GitHub projects you can study, replicate, and then rebuild with your own twist. After that, I’m dropping a complete Data Scientist roadmap you can follow step-by-step.

Part 1: Top 5 Data Science Projects (End-to-End)

1) Customer Churn Prediction Pipeline (Production-style)

Repo: AWS Customer Churn Pipeline (GitHub)
Why it’s strong: It’s not just a notebook. It’s built like a real system with training + inference pipelines, validation, tuning, and explainability baked in. (GitHub)

What you learn

End-to-end ML pipeline thinking
Validation + feature processing at scale
Explainability inside production workflows

How to make it “yours”

Replace dataset with any SaaS churn dataset (or telecom churn)
Add a “top churn drivers” report for business users

Resume bullet example

Built an end-to-end churn prediction pipeline with automated training, validation, hyperparameter tuning, and explainability.

2) Insurance Cross-Sell (AutoML + MLflow + FastAPI + Streamlit)

Repo: End-to-End AutoML Insurance (GitHub)
Why it’s strong: This is the perfect “hireable” format: model tracking (MLflow), API (FastAPI), and an interface (Streamlit). (GitHub)

What you learn

Experiment tracking and model management
Serving predictions through an API
Building a simple app stakeholders can use

How to upgrade

Add model monitoring (drift + data checks)
Add a “confidence score” and threshold slider

Resume bullet example

Deployed an AutoML classification system using MLflow tracking, FastAPI inference service, and Streamlit UI for business users.

3) Credit Card Fraud Detection (FastAPI + Streamlit, batch scoring)

Repo: Fraud Detection System (GitHub)
Why it’s strong: Fraud is a real-world DS problem: imbalance, precision/recall tradeoffs, and operational workflows. This project is built like a deployable app. (GitHub)

What you learn

Handling imbalanced datasets (SMOTE, thresholds)
Building batch scoring pipelines
Delivering downloadable analysis reports

How to make it stand out

Add cost-based evaluation (false positive vs false negative cost)
Add a “review queue” dashboard for flagged transactions

Resume bullet example

Built and deployed a fraud detection system with batch scoring, dynamic column mapping, and reporting via FastAPI + Streamlit.

4) House Price Prediction with ZenML + MLflow (Real MLOps flavor)

Repo: ZenML + MLflow House Price Pipeline (GitHub)
Why it’s strong: Shows reproducibility, pipelines, and CI/CD mindset, which is rare in typical DS portfolios. (GitHub)

What you learn

Pipeline orchestration for DS work
Experiment tracking + deployment flow
Production-grade project structure

Upgrade idea

Add feature store style transformations
Add automated retraining when data drifts

Resume bullet example

Implemented an end-to-end regression pipeline using ZenML for reproducible workflows and MLflow for tracking and deployment.

5) Recommender System as an App (FastAPI + Streamlit)

Repo: FastAPI Movie Recommender (GitHub)
Why it’s strong: Recommendations are common in DS interviews, and this includes an API + UI and even click tracking. (GitHub)

What you learn

Ranking/recommendation logic
Product analytics mindset (tracking interactions)
Packaging DS into an interactive system

Upgrade ideas

Add evaluation metrics (MAP@K, NDCG@K)
Add hybrid recommendations (content + collaborative)

Resume bullet example

Built a recommendation system with FastAPI endpoints, Streamlit UI, and interaction tracking to measure engagement.
Share

Part 2: Complete Data Scientist Roadmap (0 → Job-Ready)

Phase 0: Setup (1–2 days)

Python environment, Git/GitHub, Jupyter/VS Code
Basic Linux commands
Readme writing habit (every project gets a clean README)

Phase 1: Foundations (2–3 weeks)

Python

Data types, functions, OOP basics
Writing clean code, modular scripts

Math + Stats essentials

Probability, distributions, Bayes basics
Mean/variance, sampling, CLT intuition
Confidence intervals, hypothesis testing basics

Checkpoint

Solve 30–50 small problems (python + probability + statistics)

Phase 2: Data Skills (2–3 weeks)

SQL (non-negotiable)

Joins, window functions, CTEs
Aggregations, cohort queries

Data wrangling

pandas: joins, groupby, datetime, missing values
Data cleaning strategies and assumptions tracking

Visualization

matplotlib/plotly basics
telling a story with charts (not just plotting)

Checkpoint

Build a mini analytics report: raw CSV → cleaned dataset → SQL insights → dashboard chart pack

Phase 3: Core Machine Learning (4–6 weeks)

Supervised learning

Regression: linear, regularization, tree models
Classification: logistic regression, trees, boosting
Metrics: precision/recall, ROC-AUC, PR-AUC, F1

Workflow

Train/val/test split
Cross-validation
Feature engineering
Leakage detection
Hyperparameter tuning (grid/random)

Explainability

Feature importance, SHAP basics
Error analysis: where the model fails and why

Checkpoint

One full ML project: EDA → model → evaluation → explainability → final business recommendations

Phase 4: Specializations (pick 2, 3–6 weeks)

Pick based on your target roles.

Option A: NLP

TF-IDF → transformers
Text classification, embeddings, retrieval basics

Option B: Time Series

Baselines, backtesting, forecasting errors
Seasonality, trend, regressors

Option C: Recommenders

Collaborative filtering
Ranking metrics
Cold start strategies

Option D: Causal + Experimentation

A/B testing design
power and sample sizing (basic)
interpreting results for product decisions

Phase 5: Production and “Hireable” DS (3–6 weeks)

This is where you differentiate.

Build APIs (FastAPI)
Make a small UI (Streamlit)
Track experiments (MLflow)
Add data checks (basic validation)
Containerize (Docker)
Optional: deploy (Cloud Run / AWS / Render)

Checkpoint

2 deployable projects (with a live demo or clear run instructions)

Phase 6: Portfolio + Interview Readiness (ongoing)

Portfolio

3 strong projects max (quality > quantity)
Each project must show:
- Problem framing
- Metrics and why they matter
- Error analysis
- Business impact

Interview prep

SQL daily practice
ML concepts: bias/variance, leakage, metrics, regularization
Case studies: churn, fraud, forecasting, recommendations

The “Winning Portfolio” Strategy (Simple)

If you do only this, you’ll be in a strong spot:

Project 1: Churn (classification + explainability + business insights)
Project 2: Fraud (imbalance + thresholds + operational workflow)
Project 3: Recommender (ranking + evaluation + product analytics tracking)

All 3 should have: README, screenshots, clear setup, and a short “decision summary.”

TCS Data Analyst Interview Questions (With Solutions + Short Explanations)

Karthik Adari — Fri, 23 Jan 2026 15:58:04 GMT

1) INNER JOIN vs LEFT JOIN (SQL)

Concept

INNER JOIN → returns only matching rows in both tables
LEFT JOIN → returns all rows from left table + matching from right (non-matching becomes NULL)

Example

SELECT A.customer_id, B.order_id
FROM Customers A
LEFT JOIN Orders B
ON A.customer_id = B.customer_id;

When to use

INNER: only want customers who placed orders
LEFT: want all customers, even if no orders

2) WHERE vs HAVING (Real use case)

Concept

WHERE filters rows before grouping/aggregation
HAVING filters groups after aggregation

Example

SELECT dept, COUNT(*)
FROM Employees
GROUP BY dept
HAVING COUNT(*) > 5;

Real use case

WHERE: filter only 2025 sales rows first
HAVING: filter only regions where total sales > 1M

3) SQL: Find the 2nd highest salary

Simple approach (no ties handling)

SELECT MAX(salary) AS Second_Highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Short note

This works when you just want the next lower value than the maximum.

(If you want tie-handling version, comment and I’ll add DENSE_RANK version too.)

4) How do you handle missing values?

Common options

Drop missing rows/columns (if very small % and not important)
Impute using mean/median/mode
Predict missing values using regression / KNN imputer

Rule of thumb

If missing < ~5% → dropping can be OK
If the column is important → impute or model it

5) How do you detect outliers? (IQR / Z-score / boxplot)

IQR Method (most common)

Outliers are below Q1 − 1.5×IQR or above Q3 + 1.5×IQR

Z-score Method

If |z| > 3, treat as outlier (common threshold)

Visual checks

Boxplot and scatter plot for quick spotting

6) Normalization (1NF, 2NF, 3NF)

Goal: reduce redundancy and avoid update anomalies.

1NF: atomic values (no lists inside a cell)
2NF: remove partial dependency (depends on full composite key)
3NF: remove transitive dependency (non-key should not depend on another non-key)

Quick memory trick:
1NF = clean cells
2NF = full key dependency
3NF = no indirect dependency

7) OLTP vs OLAP (with examples)

OLTP (Transactional systems)

Fast inserts/updates
Highly normalized
Example: ATM, e-commerce checkout

OLAP (Analytics systems)

Fast reads + aggregations
Often denormalized (star schema)
Example: dashboards, reporting systems

8) What is data cleaning + checklist?

Definition
Data cleaning = making data accurate, consistent, and analysis-ready.

My checklist

Remove duplicates
Fix missing values strategy
Standardize formats (dates, currencies, categories)
Handle outliers (remove/cap/transform)
Validate ranges (age > 0, salary not negative)
Check consistency across columns (state vs zip, etc.)

9) Power BI: Measures vs Columns

Column

Calculated per row (stored in the table)
Good for row-level logic or categories

Measure

Aggregated result evaluated on the fly (changes with filters/slicers)
Best for KPIs like Sales, Profit, YoY%

Shortcut:
Columns = row-wise
Measures = filter-context dependent

10) DAX: What does CALCULATE() do?

Concept
CALCULATE() changes the filter context of a measure.

Example

Total2023Sales =
CALCULATE(SUM(Sales[Amount]), Sales[Year] = 2023)

In simple terms:
It tells Power BI: “Compute this, but under these filters.”

11) Make Power BI reports faster for huge data

High-impact optimizations:

Remove unused columns
Use Star schema
Prefer DAX measures over heavy transformations
Aggregate at source/query level
Turn off Auto date/time (often helps)
Reduce visuals on a page (too many visuals slows rendering)

12) Scenario: “Sales dropped in one region” — how to investigate?

A clean interview flow:

Compare MoM / YoY trend for that region
Break down by product, category, channel, customer segment
Check pricing changes, discounts, stock-outs, returns
Look for customer churn or loss of key accounts
Validate external factors: holidays, competition, supply chain issues

Bonus line:
“I’d validate whether it’s a data issue first (missing transactions, wrong filters, refresh failures).”

13) GROUP BY + common mistakes

Purpose
Groups same values and summarizes them.

Example

SELECT department, COUNT(*)
FROM employees
GROUP BY department;

Common mistakes

Selecting a non-aggregated column not present in GROUP BY
Using WHERE instead of HAVING for aggregate filters
Grouping at wrong granularity (monthly vs daily mismatch)

14) COUNT(*) vs COUNT(column)

COUNT(*) → counts all rows (including NULLs)
COUNT(column) → counts only rows where that column is NOT NULL

Interview-safe example:
“If some salaries are NULL, COUNT(salary) will be lower than COUNT(*)”

15) Tell me about a time you used data to drive a decision (project answer)

Use this structure:
Context → Action → Insight → Impact (with a number)

Sample
“I built an e-commerce sales dashboard and analyzed product-wise revenue, profit margins, and region performance. I found returns were unusually high in one region. After drilling down, it pointed to a logistics issue. We adjusted the delivery partner for that region and reduced the return rate by ~18%.”

Tip:
Always add one metric (18%, 2x, 10 hours saved, etc.)

Complete PDF solutions

The Ultimate Cold Outreach Template Guide

Karthik Adari — Wed, 21 Jan 2026 15:58:43 GMT

The Ultimate Cold Outreach Template Guide (v2.0)

Optimized for LinkedIn & Email

Phase 1: The Connection Request (300 Character Limit)

Strategy: No links. No pitch. Just a specific hook.

Option A: The “Fan” (Best for Hiring Managers)

Hi [Name], I recently saw your post about [Specific Topic] - the point about [Detail] really stood out to me. I’m currently building in this space and would love to connect to follow your updates.

Option B: The “Fellow Professional” (Best for Peers)

Hi [Name], I found your profile while researching [Company]. I see we both work in [Industry/Domain]. I’d love to connect to share insights in the field.

Option C: The “Alumni/Mutual” (Best for Warm Leads)

Hi [Name], I noticed we are both alumni of [University/Company]. I’m currently working in [Industry] and would love to connect with a fellow [Mascot/Alumni Name] in the space.

Phase 2: The Hiring Manager (The “Value Pitch”)

Strategy: Hyper-specific opening + Low friction CTA. Avoid attachments on LinkedIn initially.

Subject: Question regarding [Role Title] / [Specific Project]

Message:

Hi [Name],
I saw the [Role Title] opening and your team’s recent work on [Specific Initiative/News]. This role’s focus on [JD Theme, e.g., Scaling Systems] aligns perfectly with what I’ve delivered in my past work.
Quick highlights of my fit:

Relevance: [Number] years focused on [Specific Domain].
Impact: Built [Project] which resulted in [Metric/Result, e.g., 20% efficiency increase].
Skillset: Strong technical command of [Key Tool A] and [Key Tool B].

I know you are busy. Instead of a meeting, would you be open to a 2-minute overview of how I could support the current priorities?
Best,
[Your Name]
[Portfolio Link - Only include if sending via Email]

Phase 3: The Recruiter (The “Screening Checklist”)

Strategy: Make their job easy. Give them the data they need to “pass” you immediately.

Subject: Application for [Role Title] (ID: [Job ID]) - [Your Name]

Message:

Hi [Name],
I’m writing to express strong interest in the [Role Title] role (Job ID: [Number]). Based on the requirements for [Skill A] and [Skill B], I believe I am a strong technical match.
The Logistics (To save you time):

Location: [Your City] (Open to relocation/Remote)
Authorization: [Citizen / Green Card / Visa Status]
Notice Period: [2 Weeks / Immediate]
Key Skills: [Skill 1], [Skill 2], [Skill 3]

I’ve attached my resume for review. I’d love to connect if my background aligns with what you are looking for.
Best,
[Your Name]

Phase 4: The Peer Referral (The “Soft Ask”)

Strategy: Low-effort questions. If they reply, offer to write the referral blurb for them.

Step 1: The Initial Outreach

Subject: Quick question about [Company/Team]

Hi [Name],
I came across your profile while researching [Company]. I’ve built [Project 1] related to this domain, so I have always admired the team’s approach to [Topic].
I know you’re busy, but I’d love your quick take on one thing:

Is the team currently more focused on [Strategy A] or [Strategy B]?

No pressure at all - thanks for sharing your work!
Best,
[Your Name]

Step 2: The “Ask” (Send ONLY after they reply)

Thanks, [Name]. That insight is really helpful.
I actually noticed the [Job Title] role opened up on your team. Since I have a background in [Your Skill], I feel I’d be a great fit.
Would you be open to referring me? If yes, I can send over the job link and 3 bullet points about my experience so you don’t have to write anything.

Phase 5: The Follow-Up (The “Quick Bump”)

Strategy: Short, direct, and zero guilt.

Subject: Re: [Previous Subject Line]

Hi [Name],
Quick bump on this - happy to send a 2-minute overview via email/message instead of scheduling time if that’s easier.
Let me know if the role is still a priority.
Best,
[Your Name]

Key Strategy Notes

The “Quick Highlights” Section: Bullet points are essential. Hiring managers scan emails; they do not read them word-for-word. This section allows them to see your value in 3 seconds.
No Attachments on LinkedIn: LinkedIn compresses images and sometimes flags PDFs as security risks. In DMs, say: “Happy to share my resume if useful” and wait for them to say yes. For Email, attaching is fine.
The “Easy Referral”: Never make a current employee work for you. Always offer to write the “blurb” (the 3 bullet points) they can simply copy-paste into their internal referral system.
The “2-Minute Overview”: Asking for “30 minutes” feels like a burden (a meeting). Asking to “share a 2-minute overview” feels like you are being helpful and respectful of their time.

3 Golden Rules for Cold Outreach

1. Mobile Optimization is Non-Negotiable

Most recruiters and managers read LinkedIn messages on their phones. If your message looks like a “wall of text” (more than 3 sentences without a break), they will skip it. Use short paragraphs, bullet points, and bold text to guide their eyes.

2. Don’t be Assumptive

Avoid phrases like “I can solve your problems” (you don’t know their problems yet). Use softer language: “I believe I can help with the priorities this role supports” or “This aligns with my experience in [Area].”

3. Low-Friction Calls to Action (CTA)

End every message with a question that allows them to say “Yes” easily.

Bad: “Can I have 30 minutes to pick your brain?” (High effort).
Good: “Are you open to a connection?” or “Can I send a 2-minute overview?” (Low effort).