Lesson 1

Basic SQL

Before we begin our discussion of SQL, I will discuss how Codd's relational model influenced the creation of SQL. Codd's relational model had a profound influence on the creation of SQL. While they are not identical, the core principles of the relational model were instrumental in shaping the development of SQL. Here's a breakdown:
Codd's Relational Model:

Introduced in 1970 by Edgar F. Codd as a theoretical framework for organizing and managing data.
Defined key concepts like relations (tables), attributes (columns), keys, and constraints.
Emphasized a declarative query language for accessing and manipulating data.

SQL (Structured Query Language):

Developed in the early 1970s at IBM specifically for the System R, a relational database management system.
Originally called SEQUEL (Structured English Query Language).
Heavily influenced by Codd's model, incorporating most of its foundational principles.

The connection:

While Codd himself didn't directly participate in creating SQL, the IBM team responsible was familiar with his work and drew inspiration from it.
SEQUEL, the precursor to SQL, aimed to embody the expressiveness and ease of use envisioned in Codd's model.
Although SQL ultimately deviated from the pure relational model in some aspects, it still reflects Codd's core ideas and serves as a powerful tool for interacting with relational databases.

Important to note:

Codd was not entirely happy with how SQL deviated from his model and advocated for stricter adherence to its principles.
Despite his critiques, SQL's widespread adoption and success cemented its role as the standard language for relational databases.

Therefore, while not an exact replica, SQL heavily relied on and was significantly influenced by Codd's relational model. The SQL programming language was first developed in the 1970s by IBM researchers Raymond Boyce and Donald Chamberlin. The programming language, known then as SEQUEL, was created following the publishing of Edgar Todd's paper, "A Relational Model of Data for Large Shared Data Banks," in 1970. SQL is used to communicate with the table of a database using CRUD^[1] operations. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database.

What is SQL?

The Structured Query Language is one of the most powerful tools available today when it comes to working with data sets and getting the information you need from databases. This course will introduce you to SQL and explain how you can quickly start using it to query databases and how you can start to fine-tune your use of SQL. Along with Codd's definition of the relational model, he proposed a language called Alpha^[2] for manipulating the data in relational tables. Shortly after Codd's paper was released, IBM commissioned a group to build a prototype based on the ideas of Codd. This group created a simplified version of DSL/Alpha that they called SQUARE and refinements to SQUARE led to a language called SEQUEL, which was renamed SQL. SQL is now entering middle age and it has undergone a great deal of change along the way. In the mid-1980s, the American National Standards Institute (ANSI) began working on the first standard for the SQL language, which was published in 1986. Subsequent refinements led to new releases of the SQL standard in 1989, 1992, 1999, 2003, 2006 and later. Along with refinements to the core language, new features have been added to the SQL language to incorporate object-oriented functionality. The latest standard, SQL:2011, focuses on the integration of SQL and XML and defines a language called XQuery which is used to query data in XML documents. SQL goes hand in hand with the "relational model" because the result of an SQL query is a table or result set. Thus, a new permanent table can be created in a relational database simply by storing the result set of a query. Similarly, a query can use both permanent tables and the result sets from other queries as inputs. When referring to the language, it is equally acceptable to say the letters individually (i.e., S. Q. L.) or to use the word sequel.

ANSI version of SQL 2023 differentiate itself from the previous versions

Temporal Tables: Provides explicit support for managing time-varying data with columns that track changes over time.
JSON Functions and Operators: Adds a comprehensive set of functions and operators for working with JSON data, enabling easier extraction, manipulation, and storage.
Window Functions Enhancements: Extends the capabilities of window functions with new functions, including RANK() and PERCENTILE_CONT().
Common Table Expressions (CTEs) with LATERAL: Introduces LATERAL, which allows CTEs to reference data from subqueries, providing greater flexibility in data manipulation.
Improved Subquery Performance: Optimizes the performance of subqueries by leveraging table correlation and cardinality estimation.
Inline Table-Valued Function (TVF) Enhancements: Strengthens inline TVFs by allowing them to access outer query scope and return multiple result sets.
New Data Types: Extends data type support with UUID, GEOGRAPHY, and GEOMETRY types for handling unique identifiers and geospatial data.
Extended Stored Procedures: Enhances stored procedures with declarative support for error handling, transaction management, and cursor handling.

Key Differentiators from Previous Versions:

Temporal Support: Temporal Tables provide a more advanced and structured mechanism for managing temporal data compared to legacy timestamp columns.
JSON Integration: The addition of native JSON support streamlines the processing and manipulation of JSON data.
Enhanced Window Functions: The extended capabilities of window functions allow for more complex aggregation and analytical operations.
CTE with LATERAL: LATERAL enhances CTEs by enabling the creation of more efficient and reusable subqueries.
Optimized Subqueries: The performance improvements in subqueries result in faster query execution and improved scalability.
Advanced Stored Procedures: The declarative approach to error handling and transaction management in stored procedures simplifies programming and enhances code maintainability.

Role of the Structured Query Language in Big Data Analytics and Data Science

Structured Query Language (SQL) plays a pivotal role in the realms of Big Data Analytics and Data Science, serving as the lingua franca for interacting with relational databases and managing vast datasets. Its utility in these fields is underscored by its versatility, efficiency, and widespread adoption across various data management systems.

Big Data Analytics: In the context of Big Data Analytics, SQL is instrumental in facilitating the extraction, transformation, and loading (ETL) of massive volumes of data, a foundational step in the big data processing pipeline. SQL's powerful querying capabilities enable analysts to perform complex data manipulations and aggregations, essential for deriving meaningful insights from large datasets. Furthermore, the advent of SQL-based technologies tailored for big data environments, such as Apache Hive and Spark SQL, has significantly enhanced SQL's applicability. These technologies allow for SQL queries to be executed over distributed data storage frameworks like Hadoop and Spark, thereby bridging the gap between traditional relational database management and the scalability demands of big data. Through these extensions, SQL empowers analysts to harness the full potential of big data, enabling scalable data analysis, trend identification, and decision support in near real-time.
Data Science: Within Data Science, SQL's role extends beyond data retrieval to include data preparation, exploration, and feature engineering, which are critical steps in the data science workflow. Given the data-driven nature of data science, the ability to efficiently query and manipulate data is paramount. SQL allows data scientists to filter, aggregate, and transform data, facilitating the creation of datasets that are ready for analysis and modeling. Moreover, SQL's interaction with data visualization tools and statistical software enables a seamless transition from data processing to analysis and model building, thereby streamlining the data science pipeline. Additionally, understanding SQL is fundamental for data scientists working with relational databases, as it allows them to directly communicate with databases, retrieve relevant subsets of data, and perform ad-hoc queries, which are often necessary for iterative data exploration and hypothesis testing.

In conclusion, SQL's integral role in Big Data Analytics and Data Science is characterized by its ability to manage and manipulate large datasets efficiently, its adaptability to distributed data frameworks, and its facilitation of critical data processing tasks. As such, SQL remains an indispensable tool in the data practitioner's toolkit, bridging the gap between raw data and actionable insights.

Course Goals

In this course, we will cover:

What SQL is used for
How to get the information you need from the database
Why SQL is used in the manner that it is
How to get started working with the databases that you need to work with

[1] CRUD Operations: CRUD is the acronym for CREATE, READ, UPDATE and DELETE. These terms describe the four essential operations for creating and managing persistent data elements, mainly in relational and NoSQL databases.

[2] Alpha: Alpha programming language: This was the original database language proposed by Edgar F. Codd, the inventor of the relational database approach. Defined in 1971, it influenced the design of QUEL but was eventually supplanted by SQL.