Data x Python

Agenda

  • Introduction
  • Data Science with Python
    • numpy, pandas, matplotlib, and scikit-learn
  • Deep Learning with Python
  • LLMs and AI with Python
    • LLM APIs, prompt engineering, and AI agents
  • Recommendations
    • Books, online courses, YouTube channels, online communities, AI tools

DA/ML/AI with Python

  • Data manipulation with numpy, pandas, and polars
  • Visualization with matplotlib, seaborn, and plotly
  • Machine Learning with scikit-learn
  • Deep Learning with PyTorch, TensorFlow, JAX
  • GenAI and LLMs for building AI-powered applications, agents, and RAG systems

Numpy

  • numpy is a powerful library for numerical computing in Python.
    • It provides support for large, multi-dimensional arrays and matrices, as well as a large collection of mathematical functions to operate on these arrays.
  • Example:
    import numpy as np
    
    a = np.array([1, 2, 3, 4, 5]) # Create a 1-dimensional array
    print(a)
    
    b = np.array([[1, 2, 3], [4, 5, 6]]) # Create a 2-dimensional array
    print(b)
    
    mean_a = np.mean(a) # Calculate the mean of the array a
    print(mean_a)
    

Pandas

  • pandas is a library for data manipulation and analysis.
    • It provides support for a variety of data structures, including Series (1-dimensional arrays) and DataFrames (2-dimensional tables).
    • Similar to data.frame in R
  • Example:
    import pandas as pd
    
    data = {
      'name': ['Alice', 'Bob', 'Charlie', 'David'],
      'age': [25, 32, 18, 47],
      'city': ['New York', 'Paris', 'London', 'Tokyo']
      }
    
    df = pd.DataFrame(data)
    print(df)
    

Pandas

  • pandas also provides a large collection of functions for manipulating and analyzing data, including functions for filtering, grouping, merging data, and more.
    • Similar to dplyr in R.
  • An example of how to filter a DataFrame based on a condition:
    import pandas as pd
    
    data = {
        'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'city': ['New York', 'Paris', 'London', 'Tokyo']
        }
    df = pd.DataFrame(data)
    
    # Filter the DataFrame to include only people older than 30
    df_filtered = df[df['age'] > 30]
    print(df_filtered)
    

Pandas vs Polars

  • Pandas
    • Industry standard, supports Numpy arrays
    • Comparison with other tool: R, SQL, Excel
    • Pandas 3 (Jan 2026): Apache Arrow by default, faster string ops
  • Polars
    • Written in Rust, widely adopted for performance-critical workflows
    • 5-10x faster than Pandas for large datasets
    • Lazy evaluation and query optimization
    • Leverages multithreading for parallel operations
    • Similar API to Pandas but with modern improvements

Matplotlib

  • matplotlib is a library for creating visualizations in Python.
    • It provides support for a wide variety of plot types, including line plots, scatter plots, histograms, etc.
    • Similar to ggplot2 and plotly in R
  • Example:
    import matplotlib.pyplot as plt
    import numpy as np
    
    x = np.linspace(0, 10, 100)
    y = np.sin(x)
    
    plt.plot(x, y) # create a plot of the sine function over the interval [0, 10]
    plt.show()
    

Matplotlib vs Seaborn vs Plotly vs Bokeh

  • Matplotlib
    • Basic Python library for creating static and animated plots.
    • Offers extensive functionality for various plot types.
  • Seaborn
    • Built on top of Matplotlib.
    • Provides visually appealing plots with concise syntax.
  • Plotly
    • Powerful library for creating interactive, web-based visualizations.
    • Supports various chart types, animations and 3D plots.
  • Bokeh
    • Another library for creating customizable web-based visualizations.
    • Ideal for creating interactive dashboards.

Scikit-learn

  • scikit-learn is a library for machine learning in Python
  • Scikit-learn provides support for many ML algorithms, including
    • Regression
    • Classification
    • Clustering
    • ... and more
  • Scikit-learn also provides a large collection of functions for
    • Evaluating the performance of your ML models
    • Preprocessing data
    • Selecting features
    • ... and more.

scikit-learn Example

import numpy as np
from sklearn.linear_model import LinearRegression

# Generate some 2D sample data
X_train = np.random.rand(50, 2) 
y_train = 2 * X_train[:, 0] - 3 * X_train[:, 1] + 4 + 0.1 * np.random.randn(50)

# Create a LinearRegression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Print the coefficients
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

# Use the model to make predictions on a set of test data
X_test = [[0.5, 0.5], [0, 0], [1, 1]]
y_pred = model.predict(X_test)
print(y_pred)

Deep Learning Frameworks

  • PyTorch (dominant)
    • The default framework in both research and industry
    • torch.compile() for production performance
    • Easy to use, feels like standard Python, ideal for research and production
  • JAX (growing)
    • Strong in research and scientific computing (used by Google/DeepMind)
    • Excellent for performance, memory efficiency, and large-scale training
    • Functional programming approach with automatic differentiation
  • TensorFlow (still used in production)
    • Less dominant in research, but common in deployed systems
    • Keras 3 is now multi-backend (PyTorch, JAX, TF)

Using Large Language Models (LLM) with Python

Example: Using OpenAI API

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI()

response = client.responses.create(
    model="gpt-5.4-mini",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

Prompt Engineering Best Practices

  • Clear Instructions: Be specific and detailed about what you want
  • Provide Context: Give relevant background information
  • Use Examples: Few-shot learning with input-output examples
  • Chain of Thought: Ask model to think step-by-step for complex reasoning
  • System Prompts: Define role and behavior guidelines
  • Temperature Control: Lower (0.0-0.3) for consistency, higher (0.7-1.0) for creativity
  • More resources:

Building AI Agents and RAG Applications

  • AI Agents: Systems that use LLMs to make decisions and take actions
  • RAG (Retrieval-Augmented Generation): Enhance LLMs with external knowledge
    • Combine LLMs with vector databases (Pinecone, Weaviate, ChromaDB)
    • Use LlamaIndex or LangChain for RAG pipelines
  • UI Frameworks: Build interfaces quickly
    • Streamlit: Fast prototyping for data/ML apps
    • Gradio: Easy interfaces for ML models

NOW, PREPARE TO GET YOUR HANDS DIRTY!

Let's read and run some code

  • Python Environment Setup:
    • Anaconda: Full-featured distribution with 250+ packages
    • Miniconda/Miniforge: Minimal installer (recommended for experienced users)
    • uv: Ultra-fast Python package installer (recommended)
  • Interactive Notebooks:
  • Get code from "Python for Data Analysis 3rd Edition" by Wes McKinney

Recommendations

Recommendations (cont.)

global styles