Self Host Videos

By Gabriel, 30 Jun 2024 , updated 30 Jun 2024

This post is about self-hosting videos on a website. When it comes to hosting videos on a website, the most common way is to use a video hosting platform like YouTube, Vimeo, or Dailymotion. However, there are some reasons why you might want to host videos on your own server. In this post I will review the pros and cons of self-hosting videos and provide a walkthrough on how to do it. It is today an experiment to see how it goes and haven't made up my mind yet on the best approach.

Introduction

With video self-hosting one will upload the video file to their own server (can be a VPS, a dedicated server, or a shared hosting account) and then embed the video player on their website. In this approach the user is responsible for the video encoding, storage, and delivery.

The other approach is to use a video hosting platform like YouTube, Vimeo, or Dailymotion. In this case, the user uploads the video to the platform. The platform takes care of the video encoding, storage, and delivery. Then the user can embed the video player on their website or simply share the link to the video on the platform.

There is a lot of reading on the internet about pros and cons of each approach. For video hosting platform, the important benefit to me are: a good CDN, availability of the video in a large number of resolutions and encoding and, for Youtube especially, a (potential) bigger exposure because of the recommendation algorithm that can push your video to new users. For video self-hosting, what seems important to me is: full control of the player and the recommendations. Then of course there is the financial aspect, with the video hosting platform being either free but with ads served to your users or paid.

I do feel that over the years the technical advantage of videos platform have shrink with the development of web standard for videos and some problem with self-hosting videos are now just myths:

increased bandwidth consumption: yes streaming video will use more data than the text of an article but the same will apply I imagine to a video hosting platform, which will have to recover that cost somehow (either by making the customer pay for data or displaying more ads)
no variety of file formats: with the 3 most used browsers in 2024 being Safari, Chrome and Firefox. all of them the most recent HTML5 web standard, is that point still valid?
Embedding and maintaining your own video player: See above. The HTML5 video element is now supported by all modern browsers and simple to use.
Need for video of different sizes and encoding: true big platform have good tools to automatically make the uploaded video available in different resolutions and for different devices (through multiple encoding). And this requires more one-off processing power and more storage space. But I would argue that for most of the usages (blog, information website, small website) where video is just here to illustrate data, this is not a strict requirement. For a self-hosted video, a WebM container using VP9 codec is already supported by 97.4% of web users. Video platforms usually promote vigorously the benefits of serving different size and encoding for the same video but is that really important for your video? Does it worth the cost for you, if it is a paid service, or the penalisation by ads for your users, if it is a free service?

Of course the biggest platform of them all, Youtube, is likely to give your content better discovery exposure, ultimately more views, and potentially more revenue if that is the goal. But this post is about the technical aspect: is that reasonably easy to self host video, will it provide a good enough UX to your audience. I want to think it is and I believe there are some type of video usage that are better served by this approach.

Processing the video for the Web

How do you go from several raw clips on your phone to a video that is ready to be uploaded to your website with a good enough compression and the good format? Another way to put it: How can you replace all the convenient tools of a video hosting platform by using off-the-shelf tools?

In the description below I have choose to do 2 important simplifications:

serve a single-resolution video: meaning it won’t be possible for the user to choose a higher or lower resolution of the video. This is a common feature of video hosting platform, but serving only one resolution is a good way to reduce the complexity of the process and save storage space.
serve a single format: I have chosen the WebM container using the VP9 codec for video and the Opus codec for audio. Video coder H264 and audio coder AAC are would guarantee a better compatibility with very old browsers but they are not open source. a WebM container using VP9 codec is already supported by 97.4% of web users and growing. Serving a single format is a good way to reduce the complexity of the process and save storage space again.

1. Using iMovie

I have used iMovie (on Mac) to put together the right content: from multiple phone-recorded clips: selection, cut, combine and add transition, finally export into file. This is the first time I was using the tool, but for non-specialist like me, it was quite easy to use and I was able to produce a video that I was happy with.

I have the exported the video in a 540p resolution (resolution final 304x540 for the example video).

Miscellaneous: I have recorded all the clips in a portrait mode on my phone (720x1280, ie 9:16) and iMovie, AFAIK doesn’t let the user choose the aspect-ratio of the produced video when creating a “Movie” (it defaults to 16:9). Creating a Movie would result in the produced video in my case to have 2 big black bars either side of the video to fit 16:9 without cropping any part of the input clips. The trick I found (on reddit) was to create an “App preview” (instead of a “Movie”) in iMovie. This is a 9:16 ratio video.

2. Compression and encoding with ffmpeg

The videos produced by my phone do have a high bitrate (3.7Mb/s), ie they are big files, not optimized for web delivery. The video produced by iMovie is not much different (same codecs used), except for the change of resolution that does reduce the file size a bit (bitrate 2.3Mb/s).

The goal is to serve a single-resolution video with the smallest bitrate possible and still have a good enough quality expected when viewed on a mobile device. Some compression defect are ok but it shouldn’t distract the viewer from the content. When comparing with a couple of bitrate of videos elsewhere (see table below) I have settled to a bitrate around 1Mb/s.

Video producer can choose the file format and the codec used for the video and the audio. I wanted to use a modern and open source codec that is supported by most of the browsers. I’m not aiming for 100% browser support because the last ~1% of users force videos producers to use several codecs and formats for the same video. I have chosen the VP9 codec for video and the Opus codec for audio. The WebM container is used for the final video.

This is where the great tool ffmpeg comes in. Initially developed by Fabrice Bellard, it is now maintained by the FFmpeg team. It “is a free and open-source software project consisting of a suite of libraries and programs for handling video, audio, and other multimedia files and streams. At its core is the command-line ffmpeg tool itself, designed for processing video and audio files.” (from wikipedia). From my understanding, it is the swiss-army knife of video processing, and it is very popular.

With ffmpeg, I have used the VP9 codec for video and the Opus codec for audio and WebM container, ie the video format. It is recommended to use two-pass encoding for VP9, but I have used a single pass because it was much quicker, more appropriate for that small experience (and the time I had to do it!).

On my machine (MacBook Pro 2019 - Intel i9 8 cores - 16 GB), the processing time was 0.5x (eg: a 07min24sec video took 15min to encode)

3. Generating video thumbnail

I have used ffmpeg to extract a frame from the video to use as a thumbnail.

4. Generate captions with OpenAI’s Whisper

I often find myself watching video with captions on. It is not that I have a hearing problem, but I find it easier to follow the content. I would consider (without any user research to back that up :) ) that it is an important feature to have when serving videos. Transcription is a time-consuming task but I think that now, thanks to machine learning, automatic speech recognition the problem is solved in computing. Solved like chess playing is solved: it reach a point where the machine can deliver human level performance. One can achieve it using open-source tools, running on modest personal computer. I have used OpenAI’s Whisper, which is too my knowledge the first of that tool available.

Precisely I have used the great port of OpenAI’s Whisper to C++ by G. Gerganov: whisper.cpp. It run on a CPU-machine and is available as a docker image. It is a command-line tool that takes an audio file as input and outputs a transcript in the VTT format. It comes with different quantization levels, the higher the level, the better the quality of the transcript but the longer the processing time. I tested the base models initially, but found that medium.en was a better fit for my use case.

See caption extract below for the video I have posted here

using base model

[00:00:00.000 --> 00:00:11.400]   Hi, so today the walk will be to finish the concrete here so remove all the water
[00:00:11.400 --> 00:00:19.360]   accumulated over a weak here heavy rain and concrete it and then do a bit
[00:00:19.360 --> 00:00:24.840]   the same thing on the other bearer on the deck so this bearer was put a few
[00:00:24.840 --> 00:00:32.940]   months ago now the change it's to it's not kind of that way so I'm gonna unscrew
(...)

using medium.en model

(and using option to limit the segment length to 32 characters for better readability on the screen)

[00:00:00.000 --> 00:00:07.490]   Hi, so today the work will be
[00:00:07.490 --> 00:00:10.710]   to finish the concrete here, so
[00:00:10.710 --> 00:00:11.820]   remove all the water that
[00:00:11.820 --> 00:00:15.010]   accumulated over a week here,
[00:00:15.010 --> 00:00:18.070]   heavy rain, and do a bit of the
[00:00:18.070 --> 00:00:20.520]   same thing on the other
[00:00:20.520 --> 00:00:23.950]   bearer on the deck. So this be
[00:00:23.950 --> 00:00:27.270]  arer was put a few months ago,
[00:00:27.270 --> 00:00:29.920]   now the change it's not
[00:00:29.920 --> 00:00:32.080]   far enough that way. So I'm
[00:00:32.080 --> 00:00:34.270]   going to unscrew the four
[00:00:34.270 --> 00:00:36.700]   attachments to the four posts,
(...)

-> less mistakes and better respect of the punctuation.

On my machine (MacBook Pro 2019 - Intel i9 8 cores - 16 GB), the processing time was

for base: about 8x (eg: 312 sec audio -> 41 sec processing)
for medium.en: about 1x the duration of the audio file (eg: 312 sec audio -> 328 sec processing)

5. Summary of the commands

export FILENAME=20240530--deck
# Encode the video to a webm file, using vp9 codec for video and opus codec for audio
ffmpeg -i ${FILENAME}.mp4 -c:v libvpx-vp9 -crf 35 -b:v 0 -c:a libopus ${FILENAME}.webm

# Extract audio and convert it to wav
ffmpeg -i ${FILENAME}.mp4 -c:a copy -vn audios/${FILENAME}.m4a
ffmpeg -i audios/${FILENAME}.m4a -ar 16000 -ac 1 -c:a pcm_s16le audios/${FILENAME}.wav

# Generate captions
docker run -it --rm -v ${PWD}/models:/models -v ${PWD}/audios:/audios -e FILENAME=${FILENAME} ghcr.io/ggerganov/whisper.cpp:main "./main -m /models/ggml-medium.en.bin -f /audios/${FILENAME}.wav -ml 32 --output-vtt -of /audios/${FILENAME}"

# Extract a frame from the video to use as a thumbnail
ffmpeg -ss 00:00:13 -i ${FILENAME}.mp4 -frames:v 1 -q:v 2 ${FILENAME}.jpg

6. Summary of the sizes, bitrates and formats

Example given for a 1min15sec video, processed through the steps describe above, and published here.

Source	Filename	res.	format	video codec	audio codec	size	bitrate (/s)
Phone	VID_20240530_132534.mp4	720x1280	mp4	h264	aac	34.8MB	3727 kb
iMovie	20240530–deck.mp4	304x540	mp4	h264	aac	21.7MB	2321 kb
ffmpeg	20240530–deck.webm	304x540	webm	vp9	opus	5.7MB	641 kb

For comparison here are stats about other videos on different platforms:

Source	Name	duration	res.	format	video codec	audio codec	size	bitrate (/s)
WhatsApp	a video received 1 (Jan-2024)	0m52s	848x480	mp4	h264	aac	10.0MB	1575 kb
WhatsApp	a video sent 1 (May-2024)	0m58s	848x478	mp4	h264	aac	12.0MB	1695 kb
YouTube	Aznavour la boheme (live)	5m20s	320x240	mp4	h264	aac	11.2MB	286 kb
YouTube	DIY DECK Part 6 - Railing	12m11	1290x720 (720p)	mp4	h264	aac	91.6MB	1.0M
YouTube	How to install Joist hanger	1m01	1290x720 (720p)	mp4	h264	aac	4.01MB	0.5M
YouTube	How to build deck part 4 - Boards	3m37	1290x720 (720p)	mp4	h264	aac	25.7MB	0.9M

Note: The size of Youtube videos are the size of the downloaded video file with NewPipe on my Android phone.

Embedding the video on the website using HTML5 video element

The video element is supported by all modern browsers. I have used very little of the attributes available and no custom styling at all for this experiment so far:

Example of the rendered markup this video

<video id="video" controls preload="metadata" poster="/assets/videos/posters/20240530--deck.jpg" style="aspect-ratio: 304 / 540">
    <source src="/assets/videos/20240530--deck.webm" type="video/webm" />
    <track
            label="English"
            kind="subtitles"
            srclang="en"
            src="/assets/videos/captions/20240530--deck.vtt"
            default />
    Download the
    <a href="/assets/videos/20240530--deck.webm">WEBM</a>
    video.
</video>

Reviewing the main attributes:

defining a poster/thumbnail for the video: the browser can download quickly this small image and display it while the video is loading. Good practice, it prevents having to load the video file unless the visitor click play. It avoids a layout shift when the video is loaded.
preload=”metadata”: the browser will only load the metadata of the video, include the duration: a common practice and a user expectation I think: to display the duration of the video before the user clicks play. (Note: really the browser is doing a HTTP range request to get just the beginning of the file, which contains the metadata)
Source: the video file in the WebM format.
Track: the caption file in the VTT format.

I have added structured data markup VideoObject to help search engines (mainly Google) understand the content of the video, which in turn should help indexing and discovery for the video. Just trying to get back a little bit of the discoverability lost by not using YouTube!

<script type="application/ld+json">{
        "@context": "https://schema.org",
        "@type": "VideoObject",
        "name": "Deck Building Journal - Protecting the Deck",
        "description": "Adding 2 layers of water-based varnish to protect the deck.",
        "thumbnailUrl": [
            "https://www.info2007.net/assets/videos/posters/20240530--deck.jpg"
        ],
        "uploadDate": "2024-05-30T00:00:00+00:00",
        "duration": "PT1M14S",
        "contentUrl": "https://www.info2007.net/assets/videos/20240530--deck.webm"
    }</script>

Conclusion

I have followed that approach to publish 11 videos related to a home project (without editing), see Videos. I am happy with the result. The video quality is good enough for the content. The auto-generated caption is not perfect but I’ll blame my accent and sometimes a poor audio capture with my phone outside! My server is located in France, there is no CDN and when I tested it I did it from my current location in Australia: the video was loading fast enough to not being a noticeable problem.

The research work to put in place the pipeline described above took a couple of ~~hours~~ days, but now adding a video to the website is quite straightforward.

It is just the initial experiment. Hover the coming months, I will probably add more videos: This will create more scenario. I will collect usage feedback from users and from myself! (different browsers, network etc) All of it will refine this experiment and help to reach a broader conclusion as to what can and what cannot be done with self-hosted videos.