This was not the pdfs.nycourts.gov we were looking for

https://github.com/pbutland/caughtlistening

caughtlistening

This repository contains transcript data from the Trump New York trial, indictment #71543/2023 (https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/).

The first transcript provided by the New York State Unified Court System was originally in PDF format. However, this was almost immediately taken down and replaced by, what can only be described as, an utterly retarded and almost unusable alternative. I'm sure that they have their reasons. Being American probably chief among them.

The way that the transcripts were published is that each page of a transcript is a separate HTML page and within this page an embedded image displays the text. As an image!!!

This is, of course, extremely unhelpful in many ways. For example, some of these ways are:

  • it makes it extremely hard to view the material offline
  • it makes it impossible to search the transcripts
  • it makes it hard to go to a certain section of a transcript

To add insult to injury, they are using Cloudflare to verify that users are "human", which means that users are periodically asked to verify that they do actually exist.

This repository is here simply to make the information in the transcripts easier to use by converting the hosted images into text using OCR software.

Note: Since the creation of this repository, the transcripts as described above have now been replaced by the original PDF format. This repository now contains text generated from the PDF files rather than the original HTML files.

Directory structure

All transcript data can be found in the transcripts/data directory and, under this directory, there is a directory for each day of the trial.

Within each directory for each trial there are the following subdirectories:

filename description
html the content for the original HTML pages as hosted by the NY court system
text contains a text file for each page of the a transcript PDF file

That describes the "raw" data. However, at the top transcripts directory level, there are also the following files:

filename description
<YYMMDD>.txt formatted text combining all of the individual generated text files for each trial
<YYMMDD>.pdf PDF file which is an exact copy of the original transcripts
<YYMMDD>.json json file containing pertinent information for each line in a transcript
<YYMMDD>-openvoice-v1.mp3 MP3 audio file representing a TTS of a transcript using OpenVoice V1

Generated text

The text was generated using the Apache Tesseract OCR engine.

This project has two main engines:

  • an LSTM engine (v4), which uses a neural net to convert the images into text.
  • a legacy engine (v3), which uses good ol' fashioned pattern recognition.

Due to the fact that the transcript is typed, the legacy version appears to be slighty better, so that is the version that is used for the generation of the text within this repository.

Code

The source code used to generate the data files within this repository can be found at caughtlistening-tools.

What next?

Great question!

The original intent of this project was to see if it was possible to generate an audio version of the transcripts using Text-To-Speech (TTS).

A proof of concept (POC) was done based on the data within this repository, using the ElevenLabs voice API to synthesize voices. Allocating a different voice for each character. Here is a sample of the generated audio here.
An additional POC was done using OpenVoice. Samples of the generated audio for v1 and v2 can be found here and here respectively. Both POCs can be found at caughtlistening-tools.

Disclaimers

  • No guarantee is provided that any files within this repository are accurate representations of the original transcripts. The original transcripts hosted at https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/ are the source of truth and any reference to the transcripts should cite these instead of anything in this repository.
  • No guarantee is given that the directory structure or file formats will remain the same. The location of, name of, and content of any files within this repository may change at any time without notice. The files within this repository should therefore not be treated as a API for any system outside of this repository. All attempts will be made to keep the JSON files backwardly compatible where possible, but the structure is under active development.

Additional information

No AI entities (sentient or otherwise) were harmed in the production of this data.

{
"by": "dangermavin",
"descendants": 0,
"id": 40244816,
"score": 4,
"time": 1714718194,
"title": "This was not the pdfs.nycourts.gov we were looking for",
"type": "story",
"url": "https://github.com/pbutland/caughtlistening"
}
{
"author": "pbutland",
"date": null,
"description": "Contribute to pbutland/caughtlistening development by creating an account on GitHub.",
"image": "https://opengraph.githubassets.com/340dce6dca30ab5afb09cec89b22aaddbbaed7ded139e79745a9eb19f5f12f01/pbutland/caughtlistening",
"logo": "https://logo.clearbit.com/github.com",
"publisher": "GitHub",
"title": "GitHub - pbutland/caughtlistening",
"url": "https://github.com/pbutland/caughtlistening"
}
{
"url": "https://github.com/pbutland/caughtlistening",
"title": "GitHub - pbutland/caughtlistening",
"description": "caughtlistening This repository contains transcript data from the Trump New York trial, indictment #71543/2023 (https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/). The first transcript provided by...",
"links": [
"https://github.com/pbutland/caughtlistening"
],
"image": "https://opengraph.githubassets.com/340dce6dca30ab5afb09cec89b22aaddbbaed7ded139e79745a9eb19f5f12f01/pbutland/caughtlistening",
"content": "<div><article><p></p><h2>caughtlistening</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#caughtlistening\"></a><p></p>\n<p>This repository contains transcript data from the Trump New York trial, indictment #71543/2023 (<a target=\"_blank\" href=\"https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/\">https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/</a>).</p>\n<p>The first transcript provided by the <a target=\"_blank\" href=\"https://ww2.nycourts.gov/press/index.shtml\">New York State Unified Court System</a> was originally in PDF format. However, this was almost immediately taken down and replaced by, what can only be described as, an utterly retarded and almost unusable alternative. I'm sure that they have their reasons. Being American probably chief among them.</p>\n<p>The way that the transcripts were published is that each page of a transcript is a separate HTML page and within this page an embedded image displays the text. As an image!!!</p>\n<p>This is, of course, extremely unhelpful in many ways.\nFor example, some of these ways are:</p>\n<ul>\n<li>it makes it extremely hard to view the material offline</li>\n<li>it makes it impossible to search the transcripts</li>\n<li>it makes it hard to go to a certain section of a transcript</li>\n</ul>\n<p>To add insult to injury, they are using Cloudflare to verify that users are \"human\", which means that users are periodically asked to verify that they do actually exist.</p>\n<p>This repository is here simply to make the information in the transcripts easier to use by converting the hosted images into text using OCR software.</p>\n<blockquote>\n<p><strong><em>Note:</em></strong> Since the creation of this repository, the transcripts as described above have now been replaced by the original PDF format. This repository now contains text generated from the PDF files rather than the original HTML files.</p>\n</blockquote>\n<p></p><h2>Directory structure</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#directory-structure\"></a><p></p>\n<p>All transcript data can be found in the <code>transcripts/data</code> directory and, under this directory, there is a directory for each day of the trial.</p>\n<p>Within each directory for each trial there are the following subdirectories:</p>\n<table>\n<thead>\n<tr>\n<th>filename</th>\n<th>description</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>html</code></td>\n<td>the content for the original HTML pages as hosted by the NY court system</td>\n</tr>\n<tr>\n<td><code>text</code></td>\n<td>contains a text file for each page of the a transcript PDF file</td>\n</tr>\n</tbody>\n</table>\n<p>That describes the \"raw\" data. However, at the top <code>transcripts</code> directory level, there are also the following files:</p>\n<table>\n<thead>\n<tr>\n<th>filename</th>\n<th>description</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>&lt;YYMMDD&gt;.txt</code></td>\n<td>formatted text combining all of the individual generated text files for each trial</td>\n</tr>\n<tr>\n<td><code>&lt;YYMMDD&gt;.pdf</code></td>\n<td>PDF file which is an exact copy of the original transcripts</td>\n</tr>\n<tr>\n<td><code>&lt;YYMMDD&gt;.json</code></td>\n<td>json file containing pertinent information for each line in a transcript</td>\n</tr>\n<tr>\n<td><code>&lt;YYMMDD&gt;-openvoice-v1.mp3</code></td>\n<td>MP3 audio file representing a TTS of a transcript using OpenVoice V1</td>\n</tr>\n</tbody>\n</table>\n<p></p><h2>Generated text</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#generated-text\"></a><p></p>\n<p>The text was generated using the <a target=\"_blank\" href=\"https://github.com/tesseract-ocr/tesseract\">Apache Tesseract OCR</a> engine.</p>\n<p>This project has two main engines:</p>\n<ul>\n<li>an LSTM engine (v4), which uses a neural net to convert the images into text.</li>\n<li>a legacy engine (v3), which uses good ol' fashioned pattern recognition.</li>\n</ul>\n<p>Due to the fact that the transcript is typed, the legacy version appears to be slighty better, so that is the version that is used for the generation of the text within this repository.</p>\n<p></p><h2>Code</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#code\"></a><p></p>\n<p>The source code used to generate the data files within this repository can be found at <a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening-tools\">caughtlistening-tools</a>.</p>\n<p></p><h2>What next?</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#what-next\"></a><p></p>\n<p>Great question!</p>\n<p>The original intent of this project was to see if it was possible to generate an audio version of the transcripts using Text-To-Speech (TTS).</p>\n<p>A proof of concept (POC) was done based on the data within this repository, using the <a target=\"_blank\" href=\"https://elevenlabs.io/\">ElevenLabs</a> voice API to synthesize voices. Allocating a different voice for each character. Here is a sample of the generated audio <a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening/blob/main/transcript-sample-elevenlabs.mp3\">here</a>.<br />\nAn additional POC was done using <a target=\"_blank\" href=\"https://github.com/myshell-ai/OpenVoice\">OpenVoice</a>. Samples of the generated audio for v1 and v2 can be found <a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening/blob/main/transcript-sample-openvoice-v1.mp3\">here</a> and <a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening/blob/main/transcript-sample-openvoice-v2.mp3\">here</a> respectively.\nBoth POCs can be found at <a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening-tools\">caughtlistening-tools</a>.</p>\n<p></p><h2>Disclaimers</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#disclaimers\"></a><p></p>\n<ul>\n<li>No guarantee is provided that any files within this repository are accurate representations of the original transcripts.\nThe original transcripts hosted at <a target=\"_blank\" href=\"https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/\">https://pdfs.nycourts.gov/PeopleVs.DTrump-71543/transcripts/</a> are the source of truth and any reference to the transcripts should cite these instead of anything in this repository.</li>\n<li>No guarantee is given that the directory structure or file formats will remain the same. The location of, name of, and content of any files within this repository may change at any time without notice. The files within this repository should therefore not be treated as a API for any system outside of this repository. All attempts will be made to keep the JSON files backwardly compatible where possible, but the structure is under active development.</li>\n</ul>\n<p></p><h2>Additional information</h2><a target=\"_blank\" href=\"https://github.com/pbutland/caughtlistening#additional-information\"></a><p></p>\n<p>No AI entities (sentient or otherwise) were harmed in the production of this data.</p>\n</article></div>",
"author": "",
"favicon": "https://github.githubassets.com/favicons/favicon.svg",
"source": "github.com",
"published": "",
"ttr": 141,
"type": "object"
}