From R to Python

A few years ago, I was asked to help on a R codebase. However the performance was really slow and the program really hard to maintain. I decided to rewrite it in Python.

Context

A project in one of my previous company was written in R, more like a “sandbox” or Proof Of Concept. It had already thousands of lines of codes, with a pattern where lots of global were used and functions having the same name were overriden accross the code base. Something like:

file1.R
global1 <- c(...) # Big array here
global2 <- c(...) # Another big array here

function prepare(a) {
  # Prepare stuff here
}

function compute(a, b, c) {
  global2 <- prepare(a)
  # Compute stuff here
}
file2.R
source(file="file1.R")

global2 <- c(...) # Big array overwritten here
global3 <- c(...) # Another big array here

function compute(a, b, c, d) {
  global3 <- prepare(a)
  # Compute stuff here
}
file3.R
source(file="file2.R")

function prepare2(a, ...) {
  global3 <- prepare(a)
  # Compute stuff here
}

function compute2(a, b, ...) {
    global3 <- prepare2(a)
    # More stuff here
}

The program was doing lots of numerical computations for time series forecasting.
R is a language that only supports pass-by-copy to functions, that means that every parameter is first copied when passed to a function (thus making them immutable). That’s fine for small parameters, but in the case of big arrays, that takes a lot of memory and slow down the program. That’s why globals were used to prevent copying big arrays.

More over some parts of the code base was written in C to further increase performance. But it still took hours to run and compute forecasts, while taking GBs of memory.

Needless to say that the code was very very hard to maintain and make any modifications:

  • lots of globals and functions override (see above)
  • heavy use of multiprocessing using the parallel library
  • no tests
  • no documentation
  • new features needed to be added
  • lots of buts needed to be fixed (and lots of performance problems)
  • no CI (that was very minor)

And also the biggest problem was that I knew nothing about R.

The rewrite

After a few weeks of learning R and trying to improve things and make the code more maintainable, I saw no ends to this. As I tried to refactor and improve the code base, other data scientists were adding new features and copying more functions. I tried to make sense of it with them and to organize things a bit more at first, but failed, they knew their code, and it was fine for them.
Without telling anyone, I decided to rewrite the whole thing in Python in parallel.

Steps

My goal was not to drop all development on the R program, and then only rewrite in Python. It was to do both in parallel, allocating a few hours every day (maybe 2 to 3 max) on the rewrite, while working on the original program the rest of the day.

  • to lower the scope of the rewrite, I decided to keep all C functions as is, and integrate them with Cython
  • multiprocessing was replicated by joblib
  • most logic was replaced by numpy
  • keep exactly the same inputs than the R program
  • to try and make it simpler, I tried mimicked every step of the R program, and put most stuff in files having the same names (e.g file1.R became file1.py), and used the same function names, so that members of the team can navigate in the new code base
  • also generate the same output for comparison. As they were no tests, my only way of checking the results was to run both programs and check the differences
  • as I needed a base, I needed to also reproduce bugs (that I found along the way). Or fix them in the R code base, and to the same on the new Python code
  • along the way, add Python unit tests to ensure non-regression on the new code base

After a few weeks, I had a functional program covering most of the R code base. It was actually much faster than initially thought, because lots of functions in the R code base were actually unused due to file inclusion and overwrite, or could be easily merged in Python.

I then showed it to the team. Most of them were enthusiastics of the performance gains and more robust code base, and even though they were not quite fluent in Python than in R, it was ok for them to switch once the new version was validated. Others were much less happy about that, although acknowledging the performance gains, time should have been better spent trying to improve the original code base. I think they were also right, but I think in the long run, switching to Python was the best choice.

Validation

Once it was agreed to continue work on the Python version, it was time to validate it and get rid of bugs:

  • demonstrate to the team the new code base, show them testing, gather their feedbacks and improve on the new version
  • try and convince other team members of the validity of the new version (I failed here)
  • setup a CI first to have R and Python output their results
  • the R code base is used as a big integration test: the same input must give exactly the same output
  • modify the Python code to fix bugs and / or modify both programs to fix bugs if the error came from the R program
  • keep a list of all inputs / outputs and rerun them everytime, to ensure non regression between the same “version” of the programs. Which was a bit challenging to do because there were no versioning at first, new code was just commited, and only the commit hashes were used as versions

It was ok to do ugly things to make things faster and closer to the R program at first, like using globals to mimick the function behaviors, …
One of the big struggle I had was due to numerical precision between the 2 systems which output different values, but very close. Lots of computations were comparing floats or doubles, and I needed to modify a bit the programs to compare with a delta to ensure stability.

Once I was confident enough, meaning running both programs a few days without any difference in the programs outputs:

  • switch the programs in production from R to Python
  • wait a few days with both programs in parallel and still compare their output to detect potential regressions (there were none!)
  • ditch the R program, be happy
  • now only maintain the new Python program

Conclusion

  • because I decided to do it alone (my mistake), a few people were unhappy as they had a loss of ownership on the R program (that’s important!), they didn’t care about the Python program a lot and were reluctant to work on it.
    That’s the biggest lesson learnt, to do something like that I needed to communicate more with other developers.
  • order of magnitudes faster: the program ran in minutes instead of hours with a much lower memory footprint. This was mainly due to passing all values by reference with Python and a better use of multiprocessing with an improved memory management
  • once in production, the Python program could then be improved further with Cython to make it even faster. As it is more integrated with Python than C with R, development time was also faster
  • afterwards, more fixes to the code came in which were mosly due to R’s legacy being removed