How to… Python Packages

Introduction

Python inherently offers useful functionalities such as print(), enumerate(), or list.append(). However, our projects often require more specific tools. This is where packages (or libraries) come in. Packages are extensions for Python that provide specialized functions and tools to efficiently solve a variety of problems.

The most frequently used packages include:

  • numpy: For fast numerical and mathematical computations.
  • pandas: For the efficient processing of large amounts of structured data (tables).
  • matplotlib: For creating data visualizations and diagrams.

Instead of treating every function of these packages in detail, which would be too extensive, this summary focuses on practical application examples for each presented package.

NumPy

Transposing a matrix means swapping rows and columns. If \(A\) is a matrix, then its transposed matrix \(A^T\) is defined by:

\[ (A^T)_{ij} = A_{ji} \]

If one were to perform a matrix transposition with pure Python, one would have to manually iterate over rows and columns, leading to comparatively inefficient code.

matrix = [[1, 2], [3, 4]]
rows = len(matrix)
cols = len(matrix[0])
transposed = []

for i in range(cols):
    new_row = []
    for j in range(rows):
        new_row.append(matrix[j][i])
    transposed.append(new_row)

print(transposed)
[[1, 3], [2, 4]]

Depending on the size of the matrix, this process can be very time-consuming. With NumPy, we can use the optimized transposition function to solve this problem quickly and elegantly. NumPy operations are vectorized and run with significantly better performance.

import numpy as np

matrix = np.array([[1, 2], [3, 4]])
transposed = matrix.T

print(transposed)
[[1 3]
 [2 4]]
Tip

Check out the Numpy Cheat Sheet.

Pandas

Working with structured data in native Python is possible, but quickly becomes complicated and error-prone with larger data sets. Even simply summarizing a table already requires a considerable amount of code and manual data processing.

sum_col_a = 0.0
sum_col_b = 0.0
count = 0

with open("data/pandasTable.csv", "r") as f:
    header = f.readline() 
    
    for line in f:
        values = line.strip().split(',')
        sum_col_a += float(values[1])
        sum_col_b += float(values[2])
        count += 1

avg_col_a = 0.0
avg_col_b = 0.0

if count > 0:
    avg_col_a = sum_col_a / count
    avg_col_b = sum_col_b / count

print("Number of Records:", count)
print("Column ColA:")
print("  Sum:", sum_col_a)
print("  Average:", avg_col_a)
print("Column ColB:")
print("  Sum:", sum_col_b)
print("  Average:", avg_col_b)
Number of Records: 5
Column ColA:
  Sum: 82.2
  Average: 16.44
Column ColB:
  Sum: 805.0
  Average: 161.0

With Pandas, we can easily read in structured data and retrieve summary statistics with a single function call, which simplifies data analysis considerably.

import pandas as pd

df = pd.read_csv("data/pandasTable.csv")
df.describe()
ID hight width Group
count 5.000000 5.000000 5.000000 5.00000
mean 3.000000 16.440000 161.000000 1.80000
std 1.581139 10.582911 108.880669 0.83666
min 1.000000 5.900000 50.000000 1.00000
25% 2.000000 10.500000 100.000000 1.00000
50% 3.000000 12.700000 125.000000 2.00000
75% 4.000000 20.100000 200.000000 2.00000
max 5.000000 33.000000 330.000000 3.00000
Tip

Check out the Pandas Cheat Sheet.

Matplotlib

Data visualization in pure Python is a challenge, as the language was not designed for drawing diagrams without specialized packages.

data = {"A": 5, "B": 10, "C": 3}
max_val = max(data.values())
scale = 50 / max_val  # Skalierung auf max. 50 Zeichen Breite

for key, value in data.items():
    bar_length = int(value * scale)
    print(f"{key}: {'#' * bar_length} {value}")
A: ######################### 5
B: ################################################## 10
C: ############### 3

Fortunately, Matplotlib provides us with a versatile library with which we can create various types of diagrams.

import matplotlib.pyplot as plt

data = {"A": 5, "B": 10, "C": 3}

keys = list(data.keys())
values = list(data.values())

plt.figure()
plt.bar(keys, values, color=['blue', 'red', 'green'])

plt.title('Simple Data Visualization')
plt.xlabel('Category')
plt.ylabel('Value')

plt.show()

Tip

Check out the Matplotlib Cheat Sheet.

More to explore

Thousands of packages are available for Python. However, it is advisable to use well-maintained and established standard packages such as numpy, pandas, and matplotlib, as these are regularly updated and have large communities.

If you are looking for a specific tool, PyPI (Python Package Index) is the central hub: PIPY