Kuba

Posted on May 12

GDPR for Developers: What the Regulation Actually Means in Code

#security #webdev #database #law

Most articles about GDPR are written by lawyers for lawyers. This one is written by a developer for developers — with real code, real mistakes, and real consequences.

I've built several B2B SaaS products in Europe. Billing platforms, webhook inspectors, hotel management systems. Every single one had to deal with GDPR in some form. And every single time I talked to other developers about it, the conversation went one of two ways:

Either "we just added a cookie banner and called it done" — or pure panic, because someone realized they'd been copying production data into staging for two years.

Neither response is correct. GDPR is not a checkbox. It's a set of engineering constraints that affects your schema design, your logging, your test data, your third-party integrations, and your deletion logic. If you treat it as a legal problem, you'll handle it wrong. It's an engineering problem.

Let me walk you through what it actually means in code.

What GDPR Actually Cares About

Before writing any code, you need to understand what the regulation is trying to enforce. Strip away the legal language and you get six core principles:

Lawfulness — you need a legal basis to process personal data (consent, contract, legitimate interest, etc.)
Purpose limitation — you can only use data for the reason you collected it
Data minimisation — collect only what you actually need
Accuracy — keep data correct and up to date
Storage limitation — don't keep data longer than necessary
Integrity and confidentiality — protect the data technically

As a developer, principles 3, 5, and 6 hit your code directly. The rest are mostly product and legal decisions — but you're the one implementing them.

Mistake #1: Your Schema Collects Too Much

The most common GDPR violation I see in codebases is schema bloat. Developers add columns "because they might be useful later." Under GDPR, that's a problem — you're collecting data without a documented purpose.

Look at this typical user table:

CREATE TABLE users (
  id UUID PRIMARY KEY,
  email VARCHAR NOT NULL,
  first_name VARCHAR,
  last_name VARCHAR,
  phone VARCHAR,
  date_of_birth DATE,
  gender VARCHAR,
  address TEXT,
  city VARCHAR,
  country VARCHAR,
  ip_address VARCHAR,
  user_agent TEXT,
  referral_source VARCHAR,
  linkedin_url VARCHAR,
  twitter_handle VARCHAR,
  annual_income INTEGER,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Half of those columns are probably unused. annual_income? linkedin_url? If you're not actively using these fields in your application logic, you have no legal basis to store them.

The fix: for every column, document the purpose. If you can't write one sentence explaining why you need it, drop it. This is called a Record of Processing Activities (ROPA) — GDPR Article 30 requires you to maintain one.

A practical way to do this in a codebase is to annotate your entity definitions:

export type UserEntityProps = {
  id: string;
  email: string;           // purpose: authentication, transactional emails
  firstName: string;       // purpose: personalization, invoicing
  lastName: string;        // purpose: invoicing
  createdAt: Date;
  updatedAt: Date;
  // REMOVED: phone — not used in any feature
  // REMOVED: dateOfBirth — collected at signup but never used
  // REMOVED: gender — collected "for analytics" but analytics never built
};

This looks obvious written down. But I've seen production databases with columns that hadn't been read by any query in over a year.

Mistake #2: Your Staging Database Has Real User Data

This is the one that gets companies fined.

Your production database has real personal data — emails, names, payment info, addresses. Your staging environment needs realistic data to test against. So you dump production and restore it to staging. Quick, easy, and a GDPR violation.

Staging environments typically have:

Weaker access controls
More people with access (junior devs, contractors, QA)
Logs that go to less secure places
Less monitoring

The moment real PII lands in your staging database, you've created an unlawful processing activity. You're using data beyond its original purpose, storing it in an environment that wasn't disclosed to users, and likely sharing it with people who have no business seeing it.

The fix has two parts.

First, never let production PII leave production without transformation. Before any data moves to a non-production environment, it must be anonymized or pseudonymized.

Second, automate this. Manual anonymization is error-prone. Someone forgets a table. Someone adds a new column with PII and nobody updates the anonymization script. Someone copies the DB "just this once" because they need to debug something urgently.

The correct solution is a tool that:

Understands your schema
Knows which columns contain PII (by name and by value pattern)
Applies masking rules automatically and consistently
Runs in your CI/CD pipeline so staging is refreshed automatically

I'll come back to this in detail later. First, let's talk about deletion.

Mistake #3: You Don't Actually Delete Data

GDPR Article 17 gives users the right to erasure — commonly called "the right to be forgotten." When a user requests deletion, you must delete their personal data.

Most applications implement this wrong. They set a deleted_at timestamp and call it done:

async deleteUser(userId: string): Promise<void> {
  await this.userRepository.update(userId, {
    deletedAt: new Date(),
  });
}

This is not deletion. The data is still there. You've just hidden it from your UI. The user's email, name, and everything else is sitting in your database, fully readable by anyone with database access.

True erasure means different things for different data types:

Structured PII in your own database — overwrite with anonymized values, don't just soft-delete:

async eraseUser(userId: string): Promise<void> {
  await this.userRepository.update(userId, {
    email: `deleted-${userId}@erased.invalid`,
    firstName: 'Deleted',
    lastName: 'User',
    phone: null,
    address: null,
    deletedAt: new Date(),
    erasedAt: new Date(),
  });
}

Notice: keep the row with a tombstone. You need to preserve the id for referential integrity and the erasedAt timestamp to prove compliance. You just remove all the personal data from it.

Backups — this is the hard one. Your production backup from last night contains the user's real data. GDPR allows a reasonable retention period for backups (typically 30-90 days, document your policy). After that window, the backup must be deleted too. You can't restore a 2-year-old backup and bring deleted users back to life.

Third-party services — if you sent the user's data to Mailchimp, Mixpanel, Intercom, or any other SaaS, you need to trigger deletion there too. Build a checklist, ideally automated:

async handleUserErasureRequest(userId: string): Promise<void> {
  const user = await this.userRepository.findById(userId);

  // 1. Erase in your own database
  await this.eraseUser(userId);

  // 2. Remove from email marketing
  await this.mailchimpService.deleteContact(user.email);

  // 3. Remove from analytics
  await this.mixpanelService.deleteUser(user.id);

  // 4. Remove from support tool
  await this.intercomService.deleteContact(user.email);

  // 5. Log the erasure for compliance
  await this.auditLogService.log({
    event: 'USER_ERASURE_COMPLETED',
    userId,
    thirdPartyServices: ['mailchimp', 'mixpanel', 'intercom'],
    completedAt: new Date(),
  });
}

That audit log entry is not optional — it's your proof that you complied with the request.

Mistake #4: No Audit Trail

GDPR requires you to be able to demonstrate compliance. "We handle data properly" is not enough — you need to prove it.

This means audit logging for every significant operation involving personal data:

Who accessed the data
What they did with it
When
From where (IP, system)

Most applications log errors. Few log data access. Build audit logging as a first-class concern, not an afterthought.

In a NestJS application, a clean approach is using a request context to capture the actor once per request and propagate it through the service layer:

// Captured once in the auth guard
@Injectable()
export class AuthGuard implements CanActivate {
  constructor(private readonly cls: ClsService) {}

  canActivate(context: ExecutionContext): boolean {
    const request = context.switchToHttp().getRequest();

    this.cls.set('userId', request.user.id);
    this.cls.set('ip', request.ip);
    this.cls.set('userAgent', request.headers['user-agent']);

    return true;
  }
}

// Used anywhere in the service layer
@Injectable()
export class AuditLogService {
  constructor(private readonly cls: ClsService) {}

  async log(event: string, metadata?: Record<string, unknown>): Promise<void> {
    await this.auditLogRepository.create({
      event,
      userId: this.cls.get('userId'),
      ip: this.cls.get('ip'),
      metadata,
      createdAt: new Date(),
    });
  }
}

Now any service can call auditLogService.log('USER_PROFILE_VIEWED') without manually threading the request context through every function call.

What events should you log? At minimum:

User login and logout
Password changes
Personal data exports (when a user downloads their data)
Erasure requests and completions
Admin accessing user data
Bulk data operations

Mistake #5: Storing Sensitive Data in Logs

Your application logs are not GDPR-compliant by default. Every time you log a user object for debugging, you're potentially writing PII to log files that:

Get shipped to Datadog, Papertrail, or another SaaS with their own retention policies
Are accessible to more people than your database
May be retained indefinitely

// BAD — logs full user object including PII
this.logger.debug('Processing order', { user, order });

// GOOD — log only non-PII identifiers
this.logger.debug('Processing order', { 
  userId: user.id, 
  orderId: order.id,
  plan: user.plan 
});

This sounds minor but it's a systematic issue. One console.log(req.body) in the wrong place can write email addresses, payment info, or passwords into your log aggregator.

The rule: never log objects that contain PII. Log IDs and non-sensitive fields only. If you need to debug a specific user issue, do it through controlled database queries with proper access controls — not by making their data appear in logs.

Mistake #6: Ignoring Data Retention

You don't need to keep data forever. Under GDPR's storage limitation principle, you should define how long you keep each type of data and actually enforce it.

Common retention periods (document these in your privacy policy):

Data type	Typical retention	Reason
Active user account	Until erasure request	Contract
Inactive account	2 years after last login	Legitimate interest
Payment records	7 years	Tax law (overrides GDPR)
Audit logs	1-3 years	Compliance
Marketing emails	Until unsubscribe	Consent
Server logs	30-90 days	Security

Enforce this in code, not in policy documents. A nightly job that runs retention cleanup:

@Cron('0 2 * * *') // 2am daily
async runRetentionCleanup(): Promise<void> {
  const twoYearsAgo = subYears(new Date(), 2);

  // Find inactive accounts
  const inactiveUsers = await this.userRepository.find({
    where: {
      lastLoginAt: LessThan(twoYearsAgo),
      erasedAt: null,
    },
  });

  for (const user of inactiveUsers) {
    // Don't delete — anonymize (preserve referential integrity)
    await this.eraseUser(user.id);
    await this.auditLogService.log('USER_AUTO_ERASED_RETENTION_POLICY', {
      userId: user.id,
      reason: 'inactive_2_years',
    });
  }
}

Handling Test Data Properly

Back to the staging problem. Here's what a proper test data pipeline looks like.

You need a process that runs automatically and produces anonymized data your team can actually use. The process has two phases: schema analysis and data transformation.

Schema analysis — inspect your database schema and identify PII columns. Detection works at two levels: column name patterns and value patterns.

Column name heuristics catch obvious cases:

const PII_COLUMN_PATTERNS = [
  /^email$/i,
  /^(first|last|full|display)_?name$/i,
  /^phone(_number)?$/i,
  /^(ip|ip_address|ipaddr)$/i,
  /^(address|street|city|zip|postal)(_.*)?$/i,
  /^(dob|date_of_birth|birth_date)$/i,
  /^(ssn|pesel|national_id)$/i,
  /^(iban|account_number)$/i,
  /^password(_hash)?$/i,
  /^(api_key|secret|token)$/i,
];

function detectPiiByColumnName(columnName: string): boolean {
  return PII_COLUMN_PATTERNS.some(pattern => pattern.test(columnName));
}

Value pattern sampling catches columns with non-obvious names:

const EMAIL_REGEX = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
const PHONE_REGEX = /^\+?[\d\s\-().]{7,}$/;
const IP_REGEX = /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/;

async function detectPiiByValues(
  sampleValues: string[]
): Promise<PiiType | null> {
  const nonNull = sampleValues.filter(Boolean);
  if (nonNull.length === 0) return null;

  const emailMatches = nonNull.filter(v => EMAIL_REGEX.test(v)).length;
  if (emailMatches / nonNull.length > 0.8) return 'email';

  const phoneMatches = nonNull.filter(v => PHONE_REGEX.test(v)).length;
  if (phoneMatches / nonNull.length > 0.8) return 'phone';

  const ipMatches = nonNull.filter(v => IP_REGEX.test(v)).length;
  if (ipMatches / nonNull.length > 0.8) return 'ip_address';

  return null;
}

Data transformation — apply the right strategy to each column based on its type and your configuration:

type MaskingStrategy = 
  | 'fake'      // replace with realistic fake value
  | 'mask'      // partially obscure (j**@g*****.com)
  | 'scramble'  // shuffle values between rows
  | 'preserve'  // keep original
  | 'nullify'   // set to NULL
  | 'fixed';    // always set to specific value

function applyStrategy(
  value: string | null,
  strategy: MaskingStrategy,
  fakerType?: string,
  fixedValue?: string,
): string | null {
  switch (strategy) {
    case 'fake':
      return generateFakeValue(fakerType ?? 'word');
    case 'mask':
      return maskValue(value);
    case 'preserve':
      return value;
    case 'nullify':
      return null;
    case 'fixed':
      return fixedValue ?? null;
    // 'scramble' is handled at the batch level, not per-row
  }
}

function maskValue(value: string | null): string | null {
  if (!value) return null;
  if (EMAIL_REGEX.test(value)) {
    const [local, domain] = value.split('@');
    return `${local[0]}**@${domain[0]}*****.${domain.split('.').pop()}`;
  }
  // Generic masking for other types
  return value[0] + '*'.repeat(value.length - 2) + value[value.length - 1];
}

The scramble strategy deserves special attention — it preserves statistical distribution while destroying the link between a value and a specific person. For salary data, for example, you want to keep the realistic range of values (€30k-€150k) but not associate €150k with the actual person who earns it:

function scrambleColumn(values: (string | null)[]): (string | null)[] {
  const nonNull = values.filter((v): v is string => v !== null);
  // Fisher-Yates shuffle
  for (let i = nonNull.length - 1; i > 0; i--) {
    const j = Math.floor(Math.random() * (i + 1));
    [nonNull[i], nonNull[j]] = [nonNull[j], nonNull[i]];
  }
  // Put back in positions where original wasn't null
  let idx = 0;
  return values.map(v => v === null ? null : nonNull[idx++]);
}

The critical constraint in all of this: foreign key integrity must be preserved. If orders.user_id references users.id, and you're subsetting 10% of your users, you can only include orders that belong to those users. This requires building a dependency graph of your schema and processing tables in topological order.

What Pseudonymization Actually Means

GDPR distinguishes between anonymization and pseudonymization. They're not the same thing and the difference matters legally.

Anonymized data — data that cannot be linked back to an individual, even with additional information. GDPR doesn't apply to truly anonymized data. Aggregated statistics ("we have 10,000 users in Germany") are anonymized.

Pseudonymized data — data where direct identifiers are replaced with artificial identifiers (like UUIDs), but where re-identification is theoretically possible if you have the mapping. GDPR still applies to pseudonymized data — it's just considered lower risk.

When you replace a user's email with a randomly generated UUID in your analytics pipeline, that's pseudonymization. The UUID is meaningless on its own, but your mapping table could reconstruct the original email.

Practically, for test data purposes, you want anonymization — fake data generated fresh, with no mapping back to real individuals. The fake strategy above produces anonymized data. The mask strategy produces pseudonymized data (the structure is preserved, re-identification is theoretically possible for emails with few characters).

Encryption: What to Actually Encrypt

"Encrypt sensitive data" is common advice. Less common is guidance on what that actually means in practice.

Encrypt at rest: enable full-disk encryption on your database server. If you're on AWS RDS, GCP Cloud SQL, or any major managed database — turn this on. It's usually one checkbox and it covers your entire database.

Encrypt in transit: TLS for all connections. Your application connecting to your database should use TLS. Your users connecting to your API should use HTTPS. No exceptions.

Column-level encryption for specific high-risk fields (credit card numbers, health data, government IDs). Use a proper encryption library, not a homebrew solution. Store the encryption key separately from the data — ideally in a key management service (AWS KMS, HashiCorp Vault).

import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';

// Don't do this in production — use a proper KMS
const ENCRYPTION_KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY, 'hex');
const ALGORITHM = 'aes-256-gcm';

function encrypt(plaintext: string): string {
  const iv = randomBytes(16);
  const cipher = createCipheriv(ALGORITHM, ENCRYPTION_KEY, iv);

  const encrypted = Buffer.concat([
    cipher.update(plaintext, 'utf8'),
    cipher.final(),
  ]);

  const authTag = cipher.getAuthTag();

  // Store iv + authTag + encrypted together
  return Buffer.concat([iv, authTag, encrypted]).toString('base64');
}

function decrypt(ciphertext: string): string {
  const buffer = Buffer.from(ciphertext, 'base64');

  const iv = buffer.subarray(0, 16);
  const authTag = buffer.subarray(16, 32);
  const encrypted = buffer.subarray(32);

  const decipher = createDecipheriv(ALGORITHM, ENCRYPTION_KEY, iv);
  decipher.setAuthTag(authTag);

  return decipher.update(encrypted) + decipher.final('utf8');
}

What not to bother with: encrypting non-sensitive fields. Encrypting your entire users table including created_at and country adds overhead with zero compliance benefit. Be surgical — encrypt what's actually sensitive.

The GDPR Audit Checklist for Your Next Code Review

Before shipping any feature that touches personal data, go through this list:

[ ] Does this feature collect new personal data? If yes, is the purpose documented?
[ ] Is there a legal basis for processing this data?
[ ] Is this data ever logged? If yes, is it stripped of PII?
[ ] How does a user request deletion of this data?
[ ] Does deletion cascade to third-party services?
[ ] Is there a retention period defined for this data?
[ ] If this data goes to staging — is the pipeline anonymized?
[ ] Does accessing this data generate an audit log entry?
[ ] Is PII encrypted at rest and in transit?
[ ] Is there a data minimization review — do you actually need all the fields you're collecting?

The Bottom Line

GDPR compliance is not a single feature you build. It's a set of constraints that runs through your entire architecture — your schema design, your logging, your test data pipeline, your deletion logic, your third-party integrations.

The developers who handle it well treat it like any other non-functional requirement: they make decisions about it upfront, build the right abstractions early, and enforce it systematically rather than scrambling when a user submits an erasure request.

The developers who handle it poorly add a cookie banner, copy production to staging, and hope nobody notices.

In Europe right now, regulators are increasingly focused on developer-level violations — not just consent banners, but actual data handling practices. GDPR fines hit €1.2 billion in 2024. The technical implementation is no longer someone else's problem.

Build it right from the start. It's genuinely easier than fixing it later.

DEV Community