Data x Python

Agenda

  • Introduction
  • Data Science with Python
    • numpy, pandas, matplotlib, and scikit-learn
  • Deep Learning with Python
  • GenAI and LLMs
  • Recommendations
    • Books, online courses, YouTube channels, online communities

DA/ML/AI with Python

  • Data manipulation with numpy and pandas
  • Visualization with matplotlib
  • Machine Learning with scikit-learn
  • Deep Learning with PyTorch, TensorFlow, JAX
  • New opportunities with Large Language Models (LLMs) in Python for NLP and AI-driven tasks

Numpy

  • numpy is a powerful library for numerical computing in Python.
    • It provides support for large, multi-dimensional arrays and matrices, as well as a large collection of mathematical functions to operate on these arrays.
  • Example:
    import numpy as np
    
    a = np.array([1, 2, 3, 4, 5]) # Create a 1-dimensional array
    print(a)
    
    b = np.array([[1, 2, 3], [4, 5, 6]]) # Create a 2-dimensional array
    print(b)
    
    mean_a = np.mean(a) # Calculate the mean of the array a
    print(mean_a)
    

Pandas

  • pandas is a library for data manipulation and analysis.
    • It provides support for a variety of data structures, including Series (1-dimensional arrays) and DataFrames (2-dimensional tables).
    • Similar to data.frame in R
  • Example:
    import pandas as pd
    
    data = {
      'name': ['Alice', 'Bob', 'Charlie', 'David'],
      'age': [25, 32, 18, 47],
      'city': ['New York', 'Paris', 'London', 'Tokyo']
      }
    
    df = pd.DataFrame(data)
    print(df)
    

Pandas

  • pandas also provides a large collection of functions for manipulating and analyzing data, including functions for filtering, grouping, merging data, and more.
    • Similar to dplyr in R.
  • An example of how to filter a DataFrame based on a condition:
    import pandas as pd
    
    data = {
        'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'city': ['New York', 'Paris', 'London', 'Tokyo']
        }
    df = pd.DataFrame(data)
    
    # Filter the DataFrame to include only people older than 30
    df_filtered = df[df['age'] > 30]
    print(df_filtered)
    

Pandas vs Polars

  • Pandas
    • Supports Numpy arrays
    • Comparison with other tool: R, SQL, Excel
    • Pandas 2 now supports Apache Arrow for performance enhancements
  • Polars
    • Written in Rust.
    • Faster than Pandas in handling dataframes.
    • Leverages multithreading for parallel operations.

Matplotlib

  • matplotlib is a library for creating visualizations in Python.
    • It provides support for a wide variety of plot types, including line plots, scatter plots, histograms, etc.
    • Similar to ggplot2 and plotly in R
  • Example:
    import matplotlib.pyplot as plt
    import numpy as np
    
    x = np.linspace(0, 10, 100)
    y = np.sin(x)
    
    plt.plot(x, y) # create a plot of the sine function over the interval [0, 10]
    plt.show()
    

Matplotlib vs Seaborn vs Plotly vs Bokeh

  • Matplotlib
    • Basic Python library for creating static and animated plots.
    • Offers extensive functionality for various plot types.
  • Seaborn
    • Built on top of Matplotlib.
    • Provides visually appealing plots with concise syntax.
  • Plotly
    • Powerful library for creating interactive, web-based visualizations.
    • Supports various chart types, animations and 3D plots.
  • Bokeh
    • Another library for creating customizable web-based visualizations.
    • Ideal for creating interactive dashboards.

Scikit-learn

  • scikit-learn is a library for machine learning in Python
  • Scikit-learn provides support for many ML algorithms, including
    • Regression
    • Classification
    • Clustering
    • ... and more
  • Scikit-learn also provides a large collection of functions for
    • Evaluating the performance of your ML models
    • Preprocessing data
    • Selecting features
    • ... and more.

scikit-learn Example

import numpy as np
from sklearn.linear_model import LinearRegression

# Generate some 2D sample data
X_train = np.random.rand(50, 2) 
y_train = 2 * X_train[:, 0] - 3 * X_train[:, 1] + 4 + 0.1 * np.random.randn(50)

# Create a LinearRegression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Print the coefficients
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

# Use the model to make predictions on a set of test data
X_test = [[0.5, 0.5], [0, 0], [1, 1]]
y_pred = model.predict(X_test)
print(y_pred)

Deep Learning Frameworks

  • PyTorch
    • Easy to use, feels like standard Python, ideal for research and prototyping.
    • Best for small to medium research projects and flexible, iterative development.
  • TensorFlow with TensorFlow Lite
    • Widely used, but relatively old.
    • Recommended for performance on low-compute devices
  • JAX
    • Growing rapidly, known for performance and memory efficiency
    • Recommended for scientific computing, and large-scale distributed training

GenAI

Using Large Language Models (LLM) with Python

  • LLMs like GPT, BERT, llama have revolutionized NLP and AI.
  • Working with LLMs:
    • Hugging Face Transformers: Access pretrained models for tasks like text generation, summarization, and question-answering.
    • OpenAI API: Use models like GPT-4 for chatbots, content generation, code suggestions, and more.
    • ollama: An open-source LLM framework allowing users to run language models locally, offering privacy and control over data.

Example: Using OpenAI API

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

completion = client.chat.completions.create(
    model="gpt-4.1", # or other model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ],
)

print(completion.choices[0].message.content)

NOW, PREPARE TO GET YOUR HANDS DIRTY!

Let's read and run some code

  • Install Anaconda or Miniconda (for experienced users)
    • Anaconda offers a ready-to-use Python setup with pre-installed libraries for data science and machine learning.
  • Get code from "Python for Data Analysis 3rd Edition" by Wes McKinney

Recommendations

Recommendations (cont.)

global styles