VTT Extractor

Extracts the subtitle text from VTT files.

Input Assumptions:
  • Formatted in VTT
  • Empty lines separate each subtitle block
  • Sequence lines are optional
  • Sequence lines ending with -0 indicate the start of a paragraph

Regular expression used in the process:

NumberRegular ExpressionReplace
0/^(.*-0)[^\S\r\n]*$/gm"\n$1"
1/^[^\S\r\n]+$/gm""
2/\n(?:[^\r\n]*\n)?[^\r\n]*-->[^\r\n]*\n/g""
3/^([^\r\n]+)\n(?!\n)/gm"$1 "
4/^\s*|\s*$/g""
Explanation
  1. Add new lines before all new paragraphs, indicated by -0 in the sequence line (ignoring trailing whitespace).
  2. Remove whitespace characters on seemingly empty lines.
  3. Remove VTT subtitle metadata.
  4. Replace new lines inside each paragraph with a space.
  5. Remove leading and trailing whitespace.

Copyright © 2024 dustbringer