From R to Python

2024-02-21

A few years ago, I was asked to help on a R codebase. However the performance was really slow and the program really hard to maintain. I decided to rewrite it in Python.

Context

A project in one of my previous company was written in R, more like a “sandbox” or Proof Of Concept. It had already thousands of lines of code, with a pattern where lots of globals were used and functions having the same name were overridden across the code base. Something like:

file1.R

global1 <- c(...) # Big array here
global2 <- c(...) # Another big array here

function prepare(a) {
  # Prepare stuff here
}

function compute(a, b, c) {
  global2 <- prepare(a)
  # Compute stuff here
}

file2.R

source(file="file1.R")

global2 <- c(...) # Big array overwritten here
global3 <- c(...) # Another big array here

function compute(a, b, c, d) {
  global3 <- prepare(a)
  # Compute stuff here
}

file3.R

source(file="file2.R")

function prepare2(a, ...) {
  global3 <- prepare(a)
  # Compute stuff here
}

function compute2(a, b, ...) {
    global3 <- prepare2(a)
    # More stuff here
}

The program was doing lots of numerical computations for time series forecasting.
R is a language that only supports pass-by-copy to functions, which means that every parameter is first copied when passed to a function (thus making them immutable). That is fine for small parameters, but in the case of big arrays, that takes a lot of memory and slows down the program. That is why globals were used to prevent copying big arrays.

Moreover, some parts of the code base were written in C to further increase performance. But it still took hours to run and compute forecasts, while taking GBs of memory.

Needless to say that the code was very very hard to maintain and make any modifications:

lots of globals and function overrides (see above)
heavy use of multiprocessing using the parallel library
no tests
no documentation
new features needed to be added
lots of bugs needed to be fixed (and lots of performance problems)
no CI (that was very minor)

And also the biggest problem was that I knew nothing about R.

The rewrite

After a few weeks of learning R and trying to improve things and make the code more maintainable, I saw no end to this. As I tried to refactor and improve the code base, other data scientists were adding new features and copying more functions. I tried to make sense of it with them and to organize things a bit more at first, but failed. They knew their code, and it was fine for them.
Without telling anyone, I decided to rewrite the whole thing in Python in parallel.

Steps

My goal was not to drop all development on the R program, and then only rewrite in Python. It was to do both in parallel, allocating a few hours every day (maybe 2 to 3 max) on the rewrite, while working on the original program the rest of the day.

to lower the scope of the rewrite, I decided to keep all C functions as is, and integrate them with Cython
multiprocessing was replicated by joblib
most logic was replaced by numpy
keep exactly the same inputs as the R program
to try and make it simpler, I mimicked every step of the R program, and put most stuff in files having the same names (e.g. file1.R became file1.py), and used the same function names, so that members of the team could navigate the new code base
also generate the same output for comparison. As there were no tests, my only way of checking the results was to run both programs and check the differences
as I needed a base, I also needed to reproduce bugs (that I found along the way). Or fix them in the R code base, and do the same on the new Python code
along the way, add Python unit tests to ensure non-regression on the new code base

After a few weeks, I had a functional program covering most of the R code base. It was actually much faster than initially thought, because lots of functions in the R code base were actually unused due to file inclusion and override, or could be easily merged in Python.

I then showed it to the team. Most of them were enthusiastic about the performance gains and more robust code base, and even though they were not quite as fluent in Python as in R, it was ok for them to switch once the new version was validated. Others were much less happy about that. Although acknowledging the performance gains, they thought time should have been better spent trying to improve the original code base. I think they were also right, but I think in the long run, switching to Python was the best choice.

Validation

Once it was agreed to continue work on the Python version, it was time to validate it and get rid of bugs:

demonstrate to the team the new code base, show them testing, gather their feedback and improve on the new version
try and convince other team members of the validity of the new version (I failed here)
set up a CI first to have R and Python output their results
the R code base is used as a big integration test: the same input must give exactly the same output
modify the Python code to fix bugs and / or modify both programs to fix bugs if the error came from the R program
keep a list of all inputs / outputs and rerun them every time, to ensure non-regression between the same “version” of the programs. Which was a bit challenging to do because there was no versioning at first, new code was just committed, and only the commit hashes were used as versions

It was ok to do ugly things to make things faster and closer to the R program at first, like using globals to mimic the function behaviors, …
One of the big struggles I had was due to numerical precision between the two systems which output different values, but very close. Lots of computations were comparing floats or doubles, and I needed to modify a bit the programs to compare with a delta to ensure stability.

Once I was confident enough, meaning running both programs a few days without any difference in the programs’ outputs:

switch the programs in production from R to Python
wait a few days with both programs in parallel and still compare their output to detect potential regressions (there were none!)
ditch the R program, be happy
now only maintain the new Python program

Conclusion

because I decided to do it alone (my mistake), a few people were unhappy as they had a loss of ownership on the R program (that is important!), they did not care about the Python program a lot and were reluctant to work on it.
That is the biggest lesson learnt: to do something like that I needed to communicate more with other developers.
order of magnitude faster: the program ran in minutes instead of hours with a much lower memory footprint. This was mainly due to passing all values by reference with Python and a better use of multiprocessing with improved memory management
once in production, the Python program could then be improved further with Cython to make it even faster. As it is more integrated with Python than C with R, development time was also faster
afterwards, more fixes to the code came in which were mostly due to R’s legacy being removed