Brood War Korean Translations

As work slowed down during the last couple of weeks of 2024, I decided to redirect some of my energy to hobbies instead of work. One such hobby is StarCraft: Brood War (or BW for short), a classic, highly competitive RTS from '98 that still has an active community today.

Over those couple of weeks, I was able to make great progress in solving a decades-old problem in the BW community, using a mix of LLMs and free (gratis*) software.

Cultural context

Understanding the cultural context BW exists in is crucial to understand the rest of this post. High-level BW is – for all intents and purposes – a Korean game. The overwhelming majority of professional players, teams, and tournaments (as well as most of the passionate audience and community) are based in Korea, and have been for 20+ years. This is so ingrained in the BW culture that us members of the community who are not Korean call ourselves "foreigners," a slightly self-condescending term.

Similarly to chess, BW is a game of strategy with a long history of recorded games. As such, playing the game is only half the challenge; studying it just as important (especially at higher levels of play). For decades, the BW community (both Korean and foreigner) has watched and analyzed professional matches from years past and derived valuable strategic insights from them.

Note: given the nature of BW as a video game, much of the modern discourse takes place in videos and live streams published by professional BW players.

Continuing to use chess as an analogy, you might know that chess openers are collections of well-studied first moves for chess. The interactions between white-piece openers and black-piece openers have been analyzed for decades (centuries?), and their study is considered fundamental for beginner and intermediate chess players.

The only aspect of these openers you need to focus on for this post is how these openers shape the language used by the chess community. Terms like "Sicilian defense", and its "Najdork variation", "Dragon variation", or "Accelerated Dragon variation" are all community-developed names. None of these terms are written in the rules of chess. They are part of a domain-specific language the community itself has developed, condensing and communicating the aforementioned decades of study and analysis effectively.

This association of language and history, combined with the great divide between the Koreans and "foreigners," leads us to what is commonly called the Foreigner Knowledge problem.

Foreigner Knowledge problem

Very few of members of the foreigner community are fluent in Korean. Foreigner access to Korean BW discourse is a contradicting concept: if you speak Korean fluently, you have no reason to be in the foreigner community, as it only has access to material that is strictly inferior and more limited. For this reason, Korean-speaking members in the foreigner community are exceedingly rare.

In an effort to gain access to the Korean discourse, we have crowdfunded money to pay these (acquiescing) Korean-speaking foreigners to translate the subtitles of YouTube videos. Everyone involved knows the translation work is taxing and slow, and that the pay is objectively terrible. Nevertheless, this approach has afforded the community a few dozen videos a year to be translated.

As an alternative, we have relied on machine translation of text content instead. While machine translation tools like Google Translate are good at translating everyday sentences, they are ill-suited for domain-specific languages, as they are full of jargon to which the translation tool has not been exposed. To add insult to injury, translating such jargon literally is actually counter-productive.

To illustrate this, here's an excerpt of the subtitles from a Korean video tutorial, about the BW-equivalent of an opener (called a build order, or simply build):

안녕하세요 오늘 강의해 드릴 내용은 12 안마당 빌드입니다. 12 안마당의 종류와 장단점들 그리고 빌드 오더를 간단하지만 자세하게 알려드리려고 합니다. 토스전에서는 가장 부유하게 시작하고 싶을 때 사용하는 빌드고요. 테란전에서는 12 안마당으로 할 수 있는 빌드가 여러가지가 있습니다. 그래서 가장 많이 사용하는 빌드들을 몇가지 알려드리려고 합니다. 첫 번째로 투에처리 빌드인데 12 안마당으로 시작하는 빌드입니다. 12 안마당 11 스포닝풀 10가스 이제 빠른 가스를 활용한 빌드인데요. 이 빌드는 투에처리 빌드를 하실때 3에처리를 빠르게 3가스 멀티에 가져가면서 플레이를 할 때 많이 사용을 하고요. 두번째로 12압 12풀 12가스 적당히 빠른 테크트리와 적당히 빠른 3에처리 빌드입니다. 12압 12가스 적당히 빠른 테크트리와 적당히 빠른 3에처리 빌드입니다. 이 빌드 같은 경우는 흔히들 말하는 안 3에처리라고 많이들 얘기를 하는데 뮤탈리스크도 빠르고 3에처리도 빠른 그런 빌드라고 생각하시면 되요.

And here is the Google Translate version:

Hello, today's lecture is about the 12 courtyard build. I will briefly but in detail explain the types, pros and cons of the 12 courtyard, and the build order. In the Toss match, it is the build used when you want to start the most affluently. In the Terran match, there are various builds that can be done with the 12 courtyard. So I will tell you about some of the most used builds. The first is the two-processing build, which starts with the 12 courtyard. 12 courtyard 11 spawning pool 10 gas Now, this is a build that utilizes fast gas. This build is often used when playing with the two-processing build while quickly bringing the 3-processing to the 3-gas multi. The second is the 12 pressure 12 pool 12 gas moderately fast tech tree and moderately fast 3-processing build. 12 pressure 12 gas moderately fast tech tree and moderately fast 3-processing build. In the case of this build, many people talk about it as a build that is fast at both Mutalisk and 3-point processing.

You might be confused if you are not a BW player. Consider yourself lucky, because I promise you you'd be flabbergasted if you did play the game.

Anyone familiar with BW will tell you that translation's signal-to-noise ratio is well below 1/9000. There are no "courtyards" in BW. "Starting off the most affluently" is an awkward and verbose way to say "this is an economic opener/build." What the hell is a "3-point processing"?!

Sure, there are a few other recognizable... utterances, such as "Toss" (Protoss), "build", "fast gas", "moderately fast tech-tree" and "Terran" – but the overwhelming majority of it has been translated literally and is nothing but noise, destroying any context that might have redeemed those few recognizable bits.

The combination of poor automatic subtitles and literally-translated jargon has caused the Foreigner community to lag behind the Korean community for decades.

The new translation process

This is what we are now able to achieve with my new machine translation process:

Hello, today's lecture will cover the 12 Natural build. I’ll explain the types, pros, and cons of the 12 Natural build and provide a simple yet detailed build order. This build is used when you want to start the most economically against Protoss. Against Terran, there are multiple builds you can use with the 12 Natural. I’ll go over some of the most popular ones. The first one is the two-Hatchery build. Starting with the 12 Natural allows for the most economically strong opening. The 12 Natural goes with an 11 Spawning Pool and a 10 Gas. This is a build utilizing fast gas. It's often used to play with a quick third Hatchery and three gas expansions when doing the two-Hatchery build. The second is the 12 Hatch, 12 Pool, 12 Gas. It's a balanced build with a moderate tech tree and a moderately fast third Hatchery. First, I’ll briefly talk about a quick defense build strategy. Next, we’ll discuss how you can manage resources optimally for this setup and handle transitions smoothly. Though there are risks, it allows for a stable and dynamic approach in longer games.

Even for non-players, the above translation should be much more reasonable than the previous Google Translate version. For BW players, it is immediately clear what the video is about. While this is by no means a perfect translation (there are improvements to be made, as will be noted in the footer of this post), the signal-to-noise ratio has been dramatically reduced.

Furthermore, in contrast to the previous pace of a few dozen videos a year, I ended up translating about 7 videos in a single day. As a single person. During my spare time.

The least I can say is: from now on, this is the worst this will ever be!

Tech stack

The process is divided into two parts: producing subtitles, and consuming subtitles. The "production" part is aimed at members of the community who are up to date with the Korean content, and know which videos should be prioritized. The "consumption" part is aimed at everyone else who simply want to watch the translated videos.

Producing subtitles

yt-dlp + OpenAI Whisper

I initially used yt-dlp to download YouTube's automatic Korean subtitles to try and translate them. As shown above, though, they are useless. So instead of downloading the subtitles, I use yt-dlp to download the audio track of the videos.

With the audio tracks downloaded, I transcribed my own subtitles from them using Whisper. I am not sure why, but transcribing the Korean subtitles using AI (rather than whatever it is YouTube uses) provided much more clean and complete results. It also seems to do a better job of ignoring in-game sounds and other noise (such as mouse clicks).

When I first started developing this process, I was running Whisper locally, using ~10GB of VRAM. I quickly saw that this would be far too restrictive if I wanted to have others collaborators translating (install Python on Windows, create a venv, install CUDA, install git...). So, I decided to find an alternative.

Google Colab

Whisper is installed through Python's pip, and Google Colab is not just a free Jupyter notebook service – it also offers free GPUs, which happen to have just enough VRAM to run Whisper!

Creating a notebook and sharing a parameterized, read-only version of it was an ideal distribution model for this kind of work. With a little documentation, it allowed non-technical people to run it effortlessly, no matter their hardware. Furthermore, I could update the notebook with features (aka bugs), and people would receive those updates automatically (ie. making it even more accessible, no git cloning, no constant downloads).

The notebook I created receives one parameter – the YouTube URL – and generates+downloads an SRT file in Korean.

All of this was neat, but I still hadn't done the fundamental work of translating from Korean to English.

LLMs and the slang dictionary

To say that this entire project stands on the shoulders of giants would be an understatement.

TeamLiquid (TL) is the longest, oldest foreigner (or at least, american) BW community. As it turns out, sometime around 15 years ago, one TL member called konadora created a forum post with a dictionary of "BW slang used by commentators." I copied the post into a simple markdown file, KoreanSlang.txt, and provided it to the modern era's favorite piece of technology: an LLM!

Despite the multiple (and multiplying) misuses of LLMs, this problem was fundamentally a language problem, which is a perfect use of LLMs.

Using this prompt:

I will give you subtitles for you to translate from Korean to English, using the "KoreanSlang.txt" file for support. Always consider that the content you are translating exists within the context of Starcraft: Brood War. Feel free to adapt the translation, as clarity is more important than a literal translation. Always respect the timestamps and the file formatting, but feel free to correct duplicate subtitles or obvious mistakes. Short "translator notes" may be welcome if appropriate (embedded as subtitles). Do not add quotation marks if they were not present originally. Some of the subtitles may have some noise or errors; please replace them with "(unintelligible)" if you find them. Format the output as code (ie. surrounded by triple back-ticks) for easier copy-pasting.

And a Pro (Premium? Plus? who knows) ChatGPT account, it was a simple matter of providing the Korean subtitles generated by Whisper, and asking it to translate its contents.

At first, I simply took translated subtitles, downloaded the corresponding video, and applied them using my local video player – which meant that, once again, I had to figure out a way to make these subtitles accessible to others.

Note: No altruism or aspiration for good product design played a part in this. My ISP imposes a data cap, and charges me double if I ever exceed it – so downloading videos left and right was not an option! >:(

Consuming subtitles

A big consideration is that YouTube has its own subtitles feature, but only the video owner can set or update a video's subtitles. Wanting to avoid the whole download, re-upload, set subtitle work cycle, I decided to build a solution that was accessible and saved me time (and bandwidth).

TamperMonkey

I wrote a UserScript that adds a button to YouTube videos, and downloads the corresponding translated subtitles if they exist. It is able to parse SRT and VTT files, and shows them in a tiny little container below the YouTube player.

It looks like a button straight from Web 1.0, and I love it. I call it the BWKT Client.

And, well... the mention of a Client implies the existence of a Server.

Pastebin

Subtitles are fundamentally text files, and I needed a quick and easy way to share text files. Pastebin immediately came to mind as a potential candidate, if only for the proof-of-concept. It is simple to use, I have a premium account for it, and I did not (and still do not) expect traffic to be a concern in any way.

Note: there is nothing more permanent than a temporary solution.

Google Sheets + Apps Script

All I need is two columns: YouTube URL, and Subtitle URL (and maybe a third column, Language, but that's for later). Using Google Sheets as a database is ideal for this use case, as it is extremely easy to manually inspect and update, and it is also trivial to share with others.

The cherry on top is Google's Apps Script, which is basically a JavaScript runtime that has first-party access to Google Workspace, which includes Google Sheets; and it has a built-in web server to boot!

It was literally a matter of writing function doGet(req) { /* ... */ } , which receives a YouTube video ID (the eleven-character identifier at the end of https://youtube.com?v=XXXXXXXXXXX), and returns the corresponding pastebin URL.

Improvements

There are certainly improvements to be made. I still see this project as a hacked-together contraption built over the course of two weeks, with a budget of ~$30 USD (I paid for ChatGPT and more GPU time in Colab, neither of which are strictly required but were nice to have).

One such improvement is supporting multiple languages in the UserScript. After all, the Foreigner community comprises not just American players, but also Mexican, Polish, Romanian, Chinese, Taiwanese – there's dozens of languages that these videos could be translated to, and it would take only a small amount of effort to do.

There are technical improvements to be made as well. For example: right now, the BWKT Client shows the button below every YouTube video, even if subtitles are not available for that particular video. I could (should?) add another endpoint to the AppsScript web server that returns an index of translated videos, and then the extension could show/hide the button appropriately.

Note: Speaking of which, here is the public-facing, read-only database!

Shipping it

After all was said and done, I reached out to a few fellow BW enjoyers to put the subtitles to the test. The intention was to judge the quality of the translation process and to see if the subtitles worked well on YouTube. We hopped on a Voice Chat, I streamed my screen, and we started watching a 13-minute long video.

About 5 minutes in, I had to pause the video and take a step back. As it turns out, the group had been watching the video, and reading the subtitles, yes – but we also had been making comments among ourselves, replaying short clips from the video, pausing it to discuss... and in the middle of it all, we had forgotten what we were originally trying to do: evaluating the quality of the subtitles.

Needless to say, the subtitles passed the test. :)

Final thoughts

One sneaky aspect of this project that worked well in my favor is that performance is not critical, nor is scale, or latency, or anything else that most software projects often deal with. Most of what I did was glue already-existing solutions together.

The custom business logic (the UserScript, and the Python code in the Colab notebook) is short, and effortlessly maintainable. The web server is the simplest production CRUD system ever, and I see no reason for it to ever grow in complexity in any significant way.

It's the jankiest project I have shipped, and I love it.