Memory management explained in Ruby

Ruby

Publié le par

Mathieu EUSTACHY

Mathieu EUSTACHY

11 minutes de lecture

Last article we talked about threads management in Ruby and its ecosystem. Today we will talk about how memory is managed in Ruby, what is memory management and how Ruby does it. 


Provided you finish this article (you are really a nerd then), you will be able to get down to the level of abstraction of the Ruby virtual machine (Ruby VM), to define what Ruby heap and the barbaric "Ruby heap page" or "malloc" terms are, to get the fundamental role of the garbage collector in the Ruby VM.


At this point, knowing these concepts, you could totally flex as a Ruby developper.


This article is a big and long one as you can see, but I think that by reading once or twice you will get stars in your eyes, so hang on tight!


We will first delve into memory management in operating system then we will showcase how Ruby does it.


Let’s dive into it!


This article is the 5th article of a broader series about “low-level” computing concepts applied to Ruby.

  1. What is a Ruby implementation?
  2. Process management in Ruby
  3. Concurrency and parallelism in Ruby
  4. Thread management in Ruby
  5. Memory management in Ruby


Always keep in mind that I am intentionally summarising things to give you a quick overview, there is more to each concept.



Memory management in operating system


I found that most of the time, articles about memory management in Ruby do not properly explain memory management at first. I think it’s a mistake, it’s difficult to get the true nature of how Ruby manages memory without explaining this concept first, so we will start with that explanation first.


In operating systems, memory management is the function responsible for managing the computer's primary memory, which is also called the RAM (I bet you already heard this name somewhere).


RAM stands for "Random Access Memory", it is the memory used by the computer to temporarily store data that is actively being used or processed by the CPU. The higher the RAM, the better your computer is at executing heavy programs (like video games or web applications for example) since they require more memory to work efficiently.


Understanding the basics of RAM is mandatory to understand the rest of the article. Let’s have a quick look at it first.



RAM


The RAM is called “random-access” because of its fundamental way to access data: it can read and change data in any order, this results in the fact that data stored in the RAM can be read or written in almost the same amount of time.


It is a really important thing since, theoretically, a program will use a lot of things stored on the RAM at the same time, therefore everything should be accessible as quick as possible. The cost the access data stored on the RAM is the same for every data.


To summarise it, the RAM is a very efficient memory for things that are live. Now that you understand why RAM is important, I want you to be able to draw it up in your mind, I believe it helps a lot to understand it even more.


You could think of the RAM as a large array of pages, which are themselves arrays of bytes. It could be illustrated by the following code:


# Reminder: Array.new(size, value)

page_size_in_bytes = 4096 # 4 kilobytes
page = Array.new(4096, 0)

number_of_pages = 1024
ram = Array.new(number_of_pages, page)

# This would result in a 4 megabytes RAM looking like:
# => [
# [0, 0, …, 0],
# …
# [0, 0, …, 0],
# ]


(Remember, I am simplifying a lot, things are a ‘bit’ more complex)


Okay, now 2 questions arise… what are these pages and bytes I just talked about?



Page


A page is a fixed-length contiguous block of virtual memory.

To put it simply, modern operating systems use a memory management technique called “virtual memory” in order to increase memory available and other important things. Virtual memory provides an abstraction layer between a program's logical memory and the physical RAM. You could say that virtual memory stores pointers to the computer’s physical memory.


In addition, a page is the smallest unit of data for memory management in an operating system that uses virtual memory. In our example, a page is a 4 kilobytes block (2^12 bytes, or 4096 bytes), this size is the default page size for many OS due to a balance between efficiency and addressing needs.



Byte


A byte is a unit of digital information that consists of 8 bits. Each bit can represent a binary value of 0 or 1. Therefore, a byte can represent 2^8 (256) different values, ranging from 00000000 to 11111111 in binary. Binary is the language the computer understands, it comes from the physical nature of computers: a computer uses transistors to store data, they can only read 0 or 1 states.


Okay, now you probably have a better conceptualisation of what is the RAM of your computer, why it is important and how it interacts with your operating system.


It is crucial to know it because, in the next chapter, talking about memory management in Ruby, you will instantly get why the Ruby memory management is structured this way because you now know with what it interacts (the RAM). You will also better understand why things are named how they are in the Ruby VM (spoiler: you will read the term ‘page’ again)



Memory management in Ruby


As stated at the beginning or the article, its purpose of this article is not to get too much into the technicalities, it is to give you a brief overview of how Ruby manages its memory and what are the roles of each components it is using.


Memory management in Ruby can be simplified as following:


 

Just a quick explanation for this schema here: when you launch a Ruby program, it will parse and compile your Ruby code (which is, at first, some text in your code editor) thanks to the Ruby interpreter into Ruby VM instructions.


These Ruby VM instructions will then be interpreted to computer machine code, you can check their C implementation in the Ruby Github repository here: https://github.com/ruby/ruby/blob/8bff7e996cf65159b4ed7b55c284de6651b7e637/insns.def


We will not go in this rabbit hole today, we will talk about this whole process in another article. Let’s focus on memory management here.



The Ruby Heap


The Ruby heap is part of the system heap (= the RAM used for dynamic allocation), it is allocated to the Ruby process by the Ruby VM at launch time, its size depends on several factors that will not be discussed here. Just know that it allocates a base size to run Ruby code smoothly, and it can expands its size if necessary.


To expand its heap, Ruby will perform malloc calls (malloc is a C function to allocate memory from the heap (malloc is short for MemoryALLOCation)), that will add some overhead performance costs, plus, the allocated heap memory will not be given back to the OS until the program exits.


This might already light some things up in your mind like:

“That’s why I can reset the memory used by my Ruby program when I relaunch it!”


Or, updating this statement for a Rails app using a Paas like Heroku or Scalingo:

“That’s why I can reset the memory of my web worker when I relaunch it!”


Or even:

“Hey, when I start different Rails app, their web worker base memory usage is different”

(assuming no transactions where made) -> if you define a lot of objects in your Rails app, Ruby through its VM will provision more memory at launch


I really wanted to explain to you how the Ruby heap is working in detail (RVALUE and stuff) to show you the magic of Ruby, but I’ll keep it really simple here, that will be another article, we will just talk about the big picture here.


Remember we just previously simplified the RAM as something like:


# The RAM could be represented as something like:
# => [
# [0, 0, …, 0],
# …
# [0, 0, …, 0],
# ]
# With the RAM being an array of pages, and each pages being an array of bytes



Well, the Ruby heap is not really that different. There is a major update on its structure with Ruby 3.2 which complexifies it a bit, but until then it could be represented as something like:


# Reminder: Array.new(size, value)

ruby_page_size_in_bytes = 16384 # 16 kilobytes
ruby_slot_size_in_bytes = 40
number_of_slots_per_ruby_page = ruby_page_size_in_bytes / ruby_slot_size_in_bytes # 409

page = Array.new(number_of_slots_per_ruby_page, 0)
number_of_pages = x # it depends of several factors

ruby_heap = Array.new(number_of_pages, page)

# The Ruby heap could be represented as something like:
# => [
# [40 bytes slot, 40 bytes slot, …, 40 bytes slot],
# …
# [40 bytes slot, 40 bytes slot, …, 40 bytes slot],
# ]
# With the Ruby heap being an array of ruby pages, and each page being an array of 40 bytes slots



The 16 kilobytes size for Ruby heap pages and the 40 bytes size for Ruby slots were designed for efficient memory management within the Ruby VM. It’s a technical choice that will not be developed here.


And as mentioned before, it changed with Ruby 3.2, but the concept remains similar. It will be discussed in a further article.



Storing data in the Ruby heap


Okay, now that you have in mind the structure of the Ruby heap, simply imagine that the Ruby VM works by storing data in 1 slot if the data size fits, if not, it will allocate memory on the system heap with a malloc call and will store the pointer to it in 1 slot.


So, as the program goes, it will fill these 40 bytes slots little by little. And that's it, you know the basics of how Ruby manages its memory! That wasn't so hard was it?


But do we really need to keep all these things in memory until the end of the program? The answer is no, some data can be released since they are not useful anymore, and that is the Garbage Collector (GC) role!



Garbage Collector (GC)


The purpose of the Garbage Collector is to… throw garbage out as you guessed (and it is automatic, you do not have to take out the trash manually!). Garbage can defined as “stored data that we do not need anymore”. Applied to Ruby, it will roughly be “freeing heap slots that are storing objects that are no longer in use”.


As you can guess, the Garbage collector is a key component of Ruby memory management. There as many approches to collect garbage in programming, the most common one is the tracing garbage collection strategy. It is used by Ruby, but also by Javascript or Java.


This strategy works by identifying and collecting unreferenced objects by tracing reachable objects from the root of the program. The "Mark and Sweep" algorithm is a common implementation of tracing garbage collection.


To put it very simply, it works by 2 phases: marking all objects in the heap that are in used or referenced by objects in use, then sweeping, ie freeing, all objects that were not marked as used by the first phase. It is a efficient algorithm, but it can cause some latency during code execution because the Garbage Collector has to stop the program during the sweep phase.


For your knowledge, Ruby also uses other strategies in addition to the tracing strategy:

  • Generational garbage collection (since Ruby 2.1)
  • Incremental garbage collection (since Ruby 2.2)
  • “Heap compaction” garbage collection (since Ruby 2.7 cf RubyConf 2019)


As you can see, garbage collectors are pretty complex processes and are not limited to one strategy. The Ruby GC might evolve again in the future.



Wrapping it up


Okay, I think that is enough for today, we covered a lot! 😅


We went through a lot of memory management concepts, its physical layer: the RAM, and how it is applied to Ruby. Now the big picture of Ruby memory management internals should sound more familiar to you.

 

In the next article we will delve a bit more into the Ruby heap and how it stores data. This will then allow us to understand why freezing a string is a good idea, or what to choose between a Struct, a custom Object or a hash to store data.



Resources used:


Mes derniers articles