Import CVS data - Ruby vs PostgreSQL

Import CVS data - Ruby vs PostgreSQL

Import CSV to Ruby on Rails with Ruby and Postgres. Check which one is a better choice.

CSV - Plain and reliable format for data migrations

CSV is the most universal way of storing data in a text file. You can open a CSV with almost any application: Excel, Numbers, Notepad. CSV for programming languages is a piece of cake too.

CSV files work great as bridges between different platforms and tools because of their universality and simplicity. That’s why dealing with CSV format is a must-have for a software developer.

Okay, enough talking. Let’s look at how simple is importing data from CSV to Ruby on Rails app.

Import CSV data using Ruby

I’m a big fan of practical examples. And dogs. Let’s combine these hobbies and import dog data from CSV to a database using Ruby!

We’ll import dog breeds from this CSV file. At this moment, it contains 361 rows of data plus a header row. Let’s assume that we want to import all columns except id, since ActiveRecord will deal with it.

csv_text = File.read(path_to_csv_file)
csv = CSV.parse(csv_text, headers: true)
csv.each do |row|
  DogBreed.create!(row.to_hash.except("id"))
end

Just 5 lines of the code and we’re done! Nice! But… that’s a bit boring. Yeah. Let’s do the same thing with PostgreSQL and its secret weapon - COPY!

Import CSV data using PostgreSQL

Ruby (on Rails) makes us happy and lazy at once. Using CSV.parse is a no-brainer. And it's great! But the general rule is:

Database engines deal better with data.

Okay, let’s get to the Postgres CSV import.

ActiveRecord::Base.connection.execute(
  <<-SQL
    CREATE TEMPORARY TABLE temp_dog_breeds (
      id smallserial,
      name varchar,
      section varchar,
      provisional date,
      country varchar,
      url varchar,
      image varchar,
      pdf varchar
    );
    COPY temp_dog_breeds FROM '#{path_to_csv_file}' WITH (FORMAT CSV, HEADER);
    INSERT INTO dog_breeds (
      name,
      section,
      provisional,
      country,
      url,
      image,
      pdf,
      updated_at,
      created_at
    )
    SELECT name, section, provisional, country, url, image, pdf, NOW(), NOW() FROM temp_dog_breeds;
    DROP TABLE temp_dog_breeds;
  SQL
)

The code does copy data from CSV to the database using COPY command. It’s a pretty simple function, unfortunately, the data in CSV is not compliant with the dog_breeds schema. The CSV file contains id which we don't want to copy and lacks updated_at and created_at time stamps.

That’s why I created a temporary table as a middleware. So the workflow is like that:

  1. Create a temp table analogous to the CSV data

  2. Copy CSV data to the temp table

  3. Populate the desired table using the temp table data with additional time stamps and without the id

  4. Delete the temp table

Actually, we could cheat a bit and simplify this operation above. We could just remove id column within CSV and add updated_at and created_at columns 😇. In that case, dealing with the temporary table would be unnecessary.

But hardcoding timestamp is a kinda hack. And we had a nice occasion for using the temporary table.

Import CSV - Ruby vs SQL

Is it worth crafting a SQL query for importing CSV data, instead of using Ruby? It’s more complex. But do you remember that databases deal better with data than programming languages?

I benchmarked this and here is the result.

61.990  (±24.2%) i/s -    581.000  in  10.008736s
# 10 times with 2 warmup rounds

Comparison:
            SQL:       62.0 i/s
            ruby:      0.6 i/s - 112.11x  slower

Postgres was about 112 times faster!

Nice! I didn’t check the RAM usage, but you can expect that it was much lower for SQL as well 🚀

Wrap-up - tougher is faster, but… tougher

There’s no doubt that Postgres COPY smashes Ruby in terms of CSV import performance. You should always import CSV data with SQL then! Well... no.

For simple cases like the one with dog breeds, I'd go with Ruby. It’s a small dataset, we can expect that it doesn’t change too frequently over time. Be pragmatic.

When it comes to big datasets and frequent CSV imports, Postgres is much better. But remember it costs you complexity and time. Not only for crafting the query but also for maintaining it.