Master Product Data Analytics

Your Guide To Data Analytics Mastery

2. SQL & Data Manipulation (Extracting and Transforming Data)

SQL (Structured Query Language) is the standard language for interacting with relational databases.

SQL is used extensively to extract, transform, and manipulate data. This section will cover the core SQL concepts you need to know for the Meta interview and beyond.

We'll start with the fundamentals and then move on to more advanced techniques. 💪

For Anyone getting started with the Video below, is a great stating point!

2.1 Core SQL Syntax

Let's start with the building blocks of SQL. These are the fundamental commands you'll use in almost every query you write.


  • 2.1.1 SELECT, FROM, WHERE (Filtering Data)

    These are the most basic and essential SQL keywords. They form the foundation of most queries.

    • SELECT: Specifies the columns you want to retrieve from a table.
    • FROM: Specifies the table you want to retrieve data from.
    • WHERE: Filters the data based on a specified condition.

    Example:

                                    
                                        SELECT user_id, name, email
                                        FROM users
                                        WHERE country = 'USA';
                                    
                                

    This query retrieves the `user_id`, `name`, and `email` columns from the `users` table, but only for rows where the `country` column is equal to 'USA'.

    References:

  • 2.1.2 JOINs (INNER, LEFT, RIGHT, FULL OUTER - Combining Tables)

    JOINs are used to combine data from two or more tables based on a related column between them.

    • INNER JOIN: Returns rows when there is a match in both tables.
    • LEFT (OUTER) JOIN: Returns all rows from the left table, and the matched rows from the right table. If there is no match, it returns NULL for the columns from the right table.
    • RIGHT (OUTER) JOIN: Returns all rows from the right table, and the matched rows from the left table. If there is no match, it returns NULL for the columns from the left table.
    • FULL (OUTER) JOIN: Returns all rows when there is a match in either the left or right table. If there is no match, it returns NULL for the columns from the non-matching table.

    Example:

                                    
                                        SELECT orders.order_id, users.name
                                        FROM orders
                                        INNER JOIN users ON orders.user_id = users.user_id;
                                    
                                

    This query returns the `order_id` from the `orders` table and the `name` from the `users` table, but only for rows where the `user_id` is the same in both tables (i.e., only for orders that are associated with a user).

    References:

  • 2.1.3 GROUP BY and Aggregate Functions (COUNT, SUM, AVG, MIN, MAX)

    The `GROUP BY` statement groups rows that have the same values in one or more columns into a summary row. Aggregate functions are often used with `GROUP BY` to perform calculations on each group.

    • COUNT: Counts the number of rows in each group.
    • SUM: Calculates the sum of a numeric column in each group.
    • AVG: Calculates the average of a numeric column in each group.
    • MIN: Returns the minimum value of a column in each group.
    • MAX: Returns the maximum value of a column in each group.

    Example:

                                    
                                        SELECT country, COUNT(*) AS num_users
                                        FROM users
                                        GROUP BY country;
                                    
                                

    This query groups the rows in the `users` table by the `country` column and then counts the number of users in each country.

    References:

  • 2.1.4 ORDER BY (Sorting Data)

    The `ORDER BY` clause sorts the result set in ascending or descending order based on one or more columns.

    Example:

                                    
                                        SELECT user_id, name, registration_date
                                        FROM users
                                        ORDER BY registration_date DESC;
                                    
                                

    This query retrieves the `user_id`, `name`, and `registration_date` columns from the `users` table and sorts the results in descending order of `registration_date`.

    References:

  • 2.1.5 HAVING (Filtering Aggregated Data)

    The `HAVING` clause is used to filter the results of a `GROUP BY` query based on aggregate functions. It's similar to the `WHERE` clause, but it operates on grouped data rather than individual rows.

    Example:

                                    
                                        SELECT country, COUNT(*) AS num_users
                                        FROM users
                                        GROUP BY country
                                        HAVING COUNT(*) > 100;
                                    
                                

    This query groups the rows in the `users` table by `country`, counts the number of users in each country, and then filters the results to include only countries with more than 100 users.

    References:


2.2 Advanced SQL Techniques

Now that you've mastered the basics, let's move on to some more advanced SQL techniques that will enable you to perform complex data manipulations and analyses.


  • 2.2.1 Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE)

    Window functions perform calculations across a set of rows that are related to the current row. They are similar to aggregate functions, but instead of returning a single value for each group, they return a value for each row.

    • ROW_NUMBER: Assigns a unique sequential integer to each row within the partition.
    • RANK: Assigns a rank to each row within the partition, with the same rank for equal values. Gaps may appear in the sequence.
    • DENSE_RANK: Similar to RANK, but without gaps in the ranking sequence.
    • LAG: Accesses data from a previous row in the result set.
    • LEAD: Accesses data from a following row in the result set.
    • NTILE: Divides the rows within a partition into a specified number of groups (e.g., quartiles, deciles).

    Example:

                                    
                                        SELECT
                                            order_id,
                                            order_date,
                                            amount,
                                            RANK() OVER (PARTITION BY customer_id ORDER BY order_date) as order_rank
                                        FROM orders;
                                    
                                

    This query assigns a rank to each order within each customer's set of orders, based on the order date.

    References:

  • 2.2.2 Subqueries and CTEs (Common Table Expressions)

    Subqueries are queries nested within another query. They can be used to perform operations in multiple steps or to filter data based on the results of another query.

    CTEs (Common Table Expressions) are temporary result sets that you can reference within a `SELECT`, `INSERT`, `UPDATE`, or `DELETE` statement. They are defined using the `WITH` clause and are useful for breaking down complex queries into smaller, more manageable parts.

    Example (Subquery):

                                    
                                        SELECT order_id, amount
                                        FROM orders
                                        WHERE customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA');
                                    
                                

    This query uses a subquery in the `WHERE` clause to select only orders from customers in the USA.

    Example (CTE):

                                    
                                        WITH USACustomers AS (
                                            SELECT customer_id
                                            FROM customers
                                            WHERE country = 'USA'
                                        )
                                        SELECT order_id, amount
                                        FROM orders
                                        WHERE customer_id IN (SELECT customer_id FROM USACustomers);
                                    
                                

    This query uses a CTE called `USACustomers` to define a temporary result set of customers from the USA, and then uses that CTE in the main query to select orders from those customers.

    References:

  • 2.2.3 String Manipulation Functions

    SQL provides a variety of functions for manipulating strings, such as:

    • CONCAT: Concatenates two or more strings.
    • SUBSTRING: Extracts a substring from a string.
    • LENGTH: Returns the length of a string.
    • UPPER/LOWER: Converts a string to uppercase or lowercase.
    • REPLACE: Replaces occurrences of a substring within a string.
    • TRIM: Removes leading and/or trailing spaces from a string.

    Example:

                                    
                                        SELECT CONCAT(first_name, ' ', last_name) AS full_name
                                        FROM users;
                                    
                                

    References:

  • 2.2.4 Date and Time Functions

    SQL provides functions for working with dates and times, such as:

    • NOW/CURRENT_DATE/CURRENT_TIMESTAMP: Returns the current date and/or time.
    • DATE_PART/EXTRACT: Extracts a specific part of a date or time (e.g., year, month, day).
    • DATE_ADD/DATE_SUB: Adds or subtracts a time interval from a date.
    • DATEDIFF: Calculates the difference between two dates.

    Example:

                                    
                                        SELECT order_id, order_date
                                        FROM orders
                                        WHERE DATE_PART('year', order_date) = 2023;
                                    
                                

    References:


2.3 Query Optimization

Writing efficient SQL queries is crucial for working with large datasets. Here are some techniques for optimizing your queries:


  • 2.3.1 Understanding Execution Plans

    Most database systems provide a way to view the execution plan of a query, which shows how the database will execute the query. This can help you identify bottlenecks and areas for improvement.

    Example (PostgreSQL):

                                    
                                        EXPLAIN SELECT * FROM users WHERE country = 'USA';
                                    
                                

    References:

  • 2.3.2 Indexing Strategies

    Indexes are data structures that improve the speed of data retrieval operations on a database table. Creating appropriate indexes on frequently queried columns can significantly speed up queries.

    Example:

                                    
                                        CREATE INDEX idx_country ON users (country);
                                    
                                

    References:

  • 2.3.3 Efficient Joins and Filtering

    Here are some tips for writing efficient joins and filters:

    • Use `INNER JOIN` instead of `WHERE` clause to join tables when possible.
    • Filter data as early as possible in the query using `WHERE` and `HAVING` clauses.
    • Avoid using `OR` conditions in `JOIN` clauses, as they can lead to poor performance.
    • Use `EXISTS` instead of `COUNT(*)` to check for the existence of rows.

2.4 Data Cleaning with SQL

Real-world data is often messy and inconsistent. SQL can be used to clean and transform data before analysis. Here are some common data cleaning tasks:


  • 2.4.1 Handling Missing Values (NULLs)

    SQL provides functions for dealing with NULL values:

    • COALESCE: Returns the first non-NULL expression in a list.
    • IS NULL / IS NOT NULL: Checks if a value is NULL or not.
    • NULLIF: Returns NULL if two expressions are equal, otherwise returns the first expression.

    Example:

                                    
                                        SELECT COALESCE(email, 'N/A') as email
                                        FROM users;
                                    
                                

    W3Schools: SQL NULL Values

  • 2.4.2 Data Type Conversions

    You may need to convert data from one type to another (e.g., string to integer, date to string). SQL provides functions like `CAST` and `CONVERT` for this purpose.

    Example:

                                    
                                        SELECT CAST(order_date AS DATE)
                                        FROM orders;
                                    
                                

    W3Schools: SQL CAST Function (SQL Server syntax, but similar functions exist in other database systems)

  • 2.4.3 Identifying and Removing Duplicates

    Duplicate rows can skew your analysis. You can use `DISTINCT` or `GROUP BY` to identify and remove duplicates.

    Example:

                                    
                                        SELECT DISTINCT user_id
                                        FROM users;
                                    
                                

    W3Schools: SQL DISTINCT Statement