What is CSV? A Developer's Guide to Comma-Separated Values

CSV is one of those formats that nobody loves but everyone uses. It's been around since the 1970s, it's still the default export format for every spreadsheet app on the planet, and it's almost certainly sitting somewhere in your data pipeline right now. The format looks trivially simple — just values separated by commas — but parsing it correctly is surprisingly easy to get wrong. Let's go through how it actually works, where it gets messy, and how to deal with it in Python and JavaScript without shooting yourself in the foot.

What CSV Actually Is

CSV stands for Comma-Separated Values. A CSV file is plain text — no binary encoding, no metadata, no structure beyond rows and columns. Each line is a record, each value within a line is separated by a delimiter (usually a comma), and the first row is conventionally a header row with column names. Here's what a real CSV looks like:

text

order_id,customer,product,quantity,unit_price,shipped
1001,Alice Nguyen,USB-C Hub,2,34.99,true
1002,Bob Martinez,Mechanical Keyboard,1,129.00,false
1003,Alice Nguyen,HDMI Cable,3,12.50,true
1004,Carol Smith,Webcam 1080p,1,79.95,false

That's it — six columns, four data rows, and a header. No angle brackets, no curly braces, no quotes needed (yet). This simplicity is exactly why CSV persists: almost any tool on earth can open, read, and produce it. Excel, Google Sheets, PostgreSQL's COPY command, pandas, R's read.csv() — all CSV out of the box. The tradeoff is that the format has almost no structure: no types, no nesting, no schema. Every value is a string until you decide otherwise.

RFC 4180 — The Standard That Isn't Really Enforced

There is a specification: RFC 4180, published in 2005. It defines things like CRLF line endings, double-quote escaping, and how to handle embedded commas. But here's the thing — RFC 4180 is informational, not a standard. It's describing what was already common practice, not legislating it. Nobody is required to follow it, and in practice, almost nobody follows it exactly.

The result is CSV dialect chaos. You've got files that use LF instead of CRLF, files where the first row may or may not be a header, files with a trailing newline and files without, files with a UTF-8 BOM prepended by Excel. The Wikipedia article on CSV has a solid rundown of all the things that vary in practice. The safest approach: never assume anything about a CSV file you didn't produce yourself, and always use a proper parser rather than naively splitting on commas.

Quoting Rules: When Commas Appear Inside Values

This is where "just split on commas" falls apart. What happens when a value contains a comma? Or a newline? Or a double-quote? RFC 4180 — and most real-world parsers — handle it by wrapping the field in double quotes. Here's the full set of rules:

If a field contains a comma, a newline, or a double-quote, wrap the entire field in double quotes
If a field contains a double-quote, escape it by doubling it ("")
Quoted fields can span multiple lines — the newline becomes part of the value
Whitespace inside quotes is significant and must be preserved

text

order_id,customer,notes,unit_price
1005,David Lee,"Wants gift wrapping, express shipping",45.00
1006,Emma Brown,"Said: ""please handle with care""",89.99
1007,Frank Wu,"Address:
123 Main St
Apt 4B",15.50

In that example: David's notes contains a comma, so it's quoted. Emma's note contains double-quotes, so they're doubled up inside the outer quotes. Frank's address spans multiple lines — the line breaks are part of the value. A naive line.split(',') in any language will completely mangle all three of these. This is why you need a real parser.

The golden rule of CSV parsing: Never split on commas manually. Always use a dedicated CSV parsing library. It handles quoting, escape sequences, multi-line fields, and encoding — none of which a simple split can do correctly.

Delimiter Variants: Tabs, Semicolons, Pipes

Despite the name, CSV doesn't have to use commas. Several common variants use different delimiters, and you'll encounter all of them in the wild:

TSV (Tab-Separated Values) — uses \t as the delimiter. Common in bioinformatics (BLAST output, VCF files), database exports, and anywhere values themselves frequently contain commas
Semicolon-separated — the default in Excel for locales where the comma is the decimal separator (Germany, France, most of the EU). If you've ever opened a CSV from a European colleague and gotten one giant column, this is why
Pipe-separated — uses |. Common in legacy banking and insurance data exports where tabs might be stripped by mainframe systems
Fixed-width — not technically CSV, but often lumped in the same category. Columns are padded to fixed widths rather than delimited

Most CSV parsers let you specify the delimiter explicitly. When you're writing a CSV that others will consume, document your delimiter. When you're reading one you didn't produce, check the first few lines before assuming. The CSV Formatter can help you inspect and reformat files with non-standard delimiters.

Parsing CSV in Python

Python's standard library includes the csv module, and it's genuinely good. The two classes you'll use most are csv.reader for row-as-list access and csv.DictReader for row-as-dict access (which is almost always what you want).

python

import csv

# csv.reader — each row is a list of strings
with open('orders.csv', 'r', encoding='utf-8', newline='') as f:
    reader = csv.reader(f)
    header = next(reader)          # consume the header row
    for row in reader:
        order_id, customer, product, quantity, price, shipped = row
        print(f"Order {order_id}: {quantity}x {product} for {customer}")

# csv.DictReader — each row is an OrderedDict keyed by the header
with open('orders.csv', 'r', encoding='utf-8', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row['shipped'] == 'false':
            print(f"Pending: {row['order_id']} — {row['customer']}")

Two things to always include: encoding='utf-8' (or whatever the file actually uses), and newline=''. That second one is critical — the csv module does its own newline handling and needs the raw bytes. Without newline='', you can get extra blank rows on Windows.

Writing is just as straightforward with csv.writer:

python

import csv

orders = [
    {'order_id': 1008, 'customer': 'Grace Kim', 'product': 'Laptop Stand', 'quantity': 1, 'unit_price': 49.99, 'shipped': False},
    {'order_id': 1009, 'customer': 'Henry Park', 'product': 'USB-C Hub', 'quantity': 2, 'unit_price': 34.99, 'shipped': True},
]

with open('new_orders.csv', 'w', encoding='utf-8', newline='') as f:
    fieldnames = ['order_id', 'customer', 'product', 'quantity', 'unit_price', 'shipped']
    writer = csv.DictWriter(f, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerows(orders)

# To use a different delimiter (e.g. tab-separated)
with open('new_orders.tsv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f, delimiter='\t')
    writer.writerow(['order_id', 'customer', 'product'])
    writer.writerow([1008, 'Grace Kim', 'Laptop Stand'])

The csv module takes care of all the quoting automatically — if a field contains a comma or a newline, it wraps it in double-quotes without you having to think about it. That's the whole point of using a library.

Parsing CSV in JavaScript / Node.js

JavaScript doesn't have a built-in CSV parser. The temptation is to do this:

// ❌ Don't do this — breaks immediately on quoted fields
const rows = csvText.split('\n').map(line => line.split(','));

That fails the moment any value contains a comma, a newline inside quotes, or an escaped double-quote. For anything real, use a library. PapaParse is the go-to — it's fast, handles all the edge cases, works in both the browser and Node.js, and supports streaming for large files.

import Papa from 'papaparse';
import fs from 'fs';

// Parse a CSV string
const csvText = fs.readFileSync('orders.csv', 'utf8');

const result = Papa.parse(csvText, {
  header: true,          // first row becomes object keys
  skipEmptyLines: true,  // ignore blank lines at end of file
  dynamicTyping: true,   // converts "true"/"false" to booleans, numbers to numbers
});

console.log(result.data);
// [
//   { order_id: 1001, customer: 'Alice Nguyen', product: 'USB-C Hub', quantity: 2, unit_price: 34.99, shipped: true },
//   { order_id: 1002, customer: 'Bob Martinez', product: 'Mechanical Keyboard', quantity: 1, unit_price: 129, shipped: false },
//   ...
// ]

// Check for parse errors
if (result.errors.length > 0) {
  console.error('Parse errors:', result.errors);
}

// Generate CSV from an array of objects
const orders = [
  { order_id: 1008, customer: 'Grace Kim', product: 'Laptop Stand', quantity: 1, unit_price: 49.99 },
  { order_id: 1009, customer: 'Henry Park', product: 'USB-C Hub', quantity: 2, unit_price: 34.99 },
];

const csvOutput = Papa.unparse(orders);
fs.writeFileSync('export.csv', csvOutput, 'utf8');

The dynamicTyping option is handy but worth knowing about — it converts things like "34.99" to a number automatically. That's usually what you want, but it can surprise you if a field like order_id happens to be a number in the CSV but you wanted it as a string. Turn it off if you need strict string output.

For quick conversions between CSV and other formats, the CSV to JSON tool handles most common cases in the browser without any code — useful for one-off data transformations. There's also CSV to XML if you need to feed the data to an XML-based system.

Common Gotchas You'll Definitely Hit

Even with a good parser, CSV has a few well-known landmines. Here are the ones that show up most often:

BOM bytes from Excel. When Excel exports a CSV as "UTF-8", it often prepends a UTF-8 BOM (EF BB BF in hex, or \ufeff as a character). This makes the first column header look like order_id instead of order_id. In Python, open the file with encoding='utf-8-sig' instead of utf-8 to strip it automatically. PapaParse handles it transparently.
CRLF vs LF line endings. RFC 4180 specifies CRLF (\r\n), but Unix tools produce LF (\n) and old Mac files use CR (\r) alone. This is why Python's csv module needs newline='' — it handles all three internally. If you're reading raw bytes and splitting manually, you need to strip \r explicitly.
Encoding issues. CSV has no way to declare its own encoding — unlike HTML's <meta charset> or XML's <?xml encoding="..."?>. Excel often saves files in Windows-1252 (a.k.a. CP1252) rather than UTF-8, which mangles accented characters. If you see characters like Ã© instead of é, you've got a UTF-8 file being decoded as Latin-1, or vice versa. Always establish encoding out of band with whoever produces the file.
Numbers that look like dates. Excel silently converts values like 1-2 or 03/04 to dates when opening or saving. If you're exporting product codes or version numbers, prefix them with a single quote in Excel ('1-2) to prevent this — or tell whoever produces the file to do so.
Trailing commas. Some exporters emit a trailing comma at the end of every line, which creates a phantom empty column. A robust parser ignores it; a naive split creates an extra empty string element.

If you're dealing with a file that looks off, the CSV Validator can quickly tell you whether the file is well-formed and flag encoding or structural issues before you try to process it.

CSV vs JSON vs Excel — When to Use What

These three formats overlap a lot in practice, but each has a clear sweet spot:

Use CSV when you're moving flat, tabular data between systems — database exports, analytics pipelines, spreadsheet imports, bulk data loading. It's universally supported, tiny in size, and trivially diff-able in git. The constraint: it's flat. No nesting, no types, no relationships.
Use JSON when data is hierarchical or schema matters. An order with multiple line items, a config file with nested objects, an API response — these are naturally JSON. CSV would force you to denormalise the data or invent your own nesting convention. The JSON spec is clean and unambiguous; the format preserves types (numbers, booleans, null, arrays, objects).
Use Excel (.xlsx) when the output is for humans, not machines. Formatting, formulas, multiple sheets, charts — if a business user is the final consumer, Excel is often the right call. Just never use it as an interchange format between systems. The OOXML spec is enormous and the format is fragile across versions.

There's a practical heuristic: if you'd naturally view the data in a spreadsheet, CSV is probably fine. If you'd view it in a tree or a code editor, use JSON. If you're sending it to a business stakeholder who will filter and sort it, use Excel.

Wrapping Up

CSV is simple by design and deceptively tricky in practice. The format has no enforced standard, ships in multiple delimiter variants, and relies on quoting rules that break every manual parsing attempt. The fix is always the same: use a real parser (Python's built-in csv module, or PapaParse in JavaScript), always specify the encoding explicitly, and watch out for BOM bytes from Excel. Once you have a reliable parser in place, CSV is actually a great format — fast to produce, easy to inspect in any text editor, and supported everywhere. For day-to-day work with CSV files, the CSV to JSON, CSV Formatter, and CSV Validator tools handle the common operations without writing a line of code.

← All CSV articles Browse all categories →