Thoughts and ideas: Comparing R and Python

I have used R for quite some time for data analysis. Especially with the use of Tidyverse package, it has been a very decent experience. Ggplot2 package for plotting is mostly intuitive. Synergy of Tidyverse ecosystem along with availability of bioinformatics and statistical analysis software with the R platform, it is an awesome combination.

Recently, I have wondered to try out Python for my daily microbiome data analysis. Julia was another option but for some reason, it feels still incomplete. There have been decent attempt to replicate the tidyverse package in Julia such as Tidier.jl https://github.com/TidierOrg/Tidier.jl. However, it still feels work in progress.

There have been time when trying to code with more defensive approach in R has lead to very cumbersome code. For example when trying to apply try and catch statement. This example was generated using chatGPT 4 which was similar to my use case:

-----------------------------------------------------------------------------------------------------------------------------

# Sample data: Matrices of numbers for which the logarithm will be calculated

matrix1 <- matrix(c(10, -1, 20, -5), nrow = 2)

matrix2 <- matrix(c(30, -2, 40, 25), nrow = 2)

# List of matrices

matrices <- list(matrix1, matrix2)

# Outer loop iterates over the list of matrices

for (i in seq_along(matrices)) {

# Inner loop iterates over rows of each matrix

for (j in seq_len(nrow(matrices[[i]]))) {

# Use tryCatch to handle potential issues at the row level

tryCatch({

# Second inner loop iterates over columns of each row

for (k in seq_len(ncol(matrices[[i]]))) {

# Actual operation inside the deepest loop

tryCatch({

log_result <- log(matrices[[i]][j, k])

print(paste("Logarithm of element [", j, ",", k, "] in matrix", i, "is", log_result))

}, error = function(e) {

print(paste("Error for element [", j, ",", k, "] in matrix", i, ":", e$message))

})

}

}, error = function(e) {

print(paste("Error while processing row", j, "in matrix", i, ":", e$message))

})

}

}

-----------------------------------------------------------------------------------------------------------------------------

This step was really cumbersome in R. I agree that it is not the most idiomatic way to code in the R universe. I asked chatGPT to do that same steps using lapply or apply family of functions.

-----------------------------------------------------------------------------------------------------------------------------

# Define matrices

matrix1 <- matrix(c(10, -1, 20, -5), nrow = 2)

matrix2 <- matrix(c(30, -2, 40, 25), nrow = 2)

# Create a list of matrices

matrices <- list(matrix1, matrix2)

# Function to process each element of a row

process_element <- function(elem) {

tryCatch({

log_result <- log(elem)

return(paste("Log result:", log_result))

}, warning = function(w) {

return(paste("Warning:", w$message))

}, error = function(e) {

return(paste("Error:", e$message))

})

}

# Function to process each row

process_row <- function(row) {

sapply(row, process_element)

}

# Function to process each matrix

process_matrix <- function(matrix) {

apply(matrix, 1, process_row) # Using apply to iterate over rows

}

# Use lapply to process each matrix in the list

results <- lapply(matrices, process_matrix)

# Print the results

print(results)

-----------------------------------------------------------------------------------------------------------------------------

It is more readable and functional but I think for loop in Python makes it more approachable. Below is the example of same operations in Python:

-----------------------------------------------------------------------------------------------------------------------------

import numpy as np

# Sample data: Numpy arrays (similar to matrices in R)

array1 = np.array([[10, -1], [20, -5]])

array2 = np.array([[30, -2], [40, 25]])

# List of arrays (similar to list of matrices in R)

arrays = [array1, array2]

# Outer loop iterates over the list of arrays

for i, arr in enumerate(arrays):

# Inner loop iterates over rows of each array

for j in range(arr.shape[0]):

try:

# Second inner loop iterates over columns of each row

for k in range(arr.shape[1]):

try:

# Attempt the operation (e.g., logarithm) on the element

log_result = np.log(arr[j, k])

print(f"Logarithm of element [{j}, {k}] in array {i+1} is {log_result}")

except ValueError as e:

# Handle errors for individual operations

print(f"Error for element [{j}, {k}] in array {i+1}: {e}")

except Exception as e:

# Handle errors that might occur while processing the row

print(f"Error while processing row {j} in array {i+1}: {e}")

-----------------------------------------------------------------------------------------------------------------------------

Thinking about this and trying to replicated my analysis "pipeline" in Python has made me think about things that I take for granted when using R. R is more straightforward to use for statistical analysis when it comes to using which interpreter I need to use. Within VSCode, R has nice dataview facility that Python is missing really badly. When I say Python, I mean without Jupyter. I despise notebooks since they are not sustainable for long form analysis.

I will keep updating....

Thoughts and ideas

Tuesday, April 9, 2024

Comparing R and Python

No comments:

Post a Comment

Adding GPG keys to Github account

Report Abuse