RS Markov Chain Generator

This project was developed in 2010, mainly for fun. It is now deprecated. This page only exists for archival purposes.

Project summary

This application tries to generate somewhat correct text by using an algorithm called Markov chain. Simply put, Markov chain is an algorithm that makes its next state dependent on the previous one. Even more simpler, it analyzes correct text written by humans, makes a link between the used words, then tries to generate new text, knowing which word can follow which.

The software was written in C# and makes heavy use of the LINQ extensions to browse the database.

[Note from 2016 me: While this project is nothing serious, just me screwing around during an afternoon, I did get more serious with natural language processing during my university years. I co-created and published a sentiment analysis engine for the Hungarian language, see paper 10.1515/ausi-2015-0018.]

Output

Well, obviousily, a simple algorithm won't be able to somehow turn into Skynet overnight and master the human language. So, rather than rating the text by correctness, rate it by lulz or epic fails. It's much more entertaining that way. However, don't try to rate the text by cuils. I tried it, the result is $943 \cdot 10^{1024}$ cuils.

Dictionaries

For one-click fun, there are 3 included dictionaries: English, Hungarian and Romanian. For maximum lols, they're all scientifical, so they can generate gems like "The antigravity force ate the tortoise".

If you want to use these dictionaries for other purposes, you're free to use them. They're not encrypted, just compressed with System.IO.Compression.DeflateStream. After decompression deserialize them with System.Runtime.Serialization.Formatters.Binary.BinaryFormatter. (Actually, you can directly provide the DeflateStream to the BinaryFormatter, and save memory and time!)

Markov Chain implementation

This is how I did it. Also you will need these if you want to deserialize the dictionaries.

[Serializable]
public class MarkovChain : Dictionary<string, Word>
{
    public MarkovChain() : base() { }
    public MarkovChain(SerializationInfo info, StreamingContext context) : base(info, context) { }

    public new Word this[string key]
    {
        get
        {
            if (base.ContainsKey(key))
            {
                return base[key];
            }
            else
            {
                return base[key] = new Word(key);
            }
        }
        set
        {
            base[key] = value;
        }
    }
}

[Serializable]
public class Word
{
    public string Value;
    public bool Starts;     // was it seen to start a sentence?
    public bool Ends;       // ...or to end one?
    public List<Word> Next;

    public Word(string value, List<Word> next)
    {
        Value  = value;
        Starts = false;
        Ends   = false;
        Next   = next ?? new List<Word>();
    }

    public Word(string value, bool starts = false, bool ends = false, List<Word> next = null)
    {
        Value  = value;
        Starts = starts;
        Ends   = ends;
        Next   = next ?? new List<Word>();
    }

    public Word GetNext()
    {
        if (Next.Count == 0)
        {
            throw new Exception("Fun's over. Out of words.");
        }

        return Next[Utils.Random.Next(Next.Count)];
    }
}

You will need to use C# 4.0 to compile this code, because of the optional parameters on the second constructor of the Word class. Once you have a dictionary, it's pretty easy to play with it. Pick a sentence starter word from the MarkovChain, then just call the GetNext() on the last Word to get the words. To actually look like a sentence, you will need to implement some grammar functions, and use Starts and Ends properties to determine whether a word can start or end a sentence.

If you make new dictionaries or you succeed in perfecting the text generator make sure to drop me an email! I'd like to see how better can get an algorithm at speaking.

Ideas for perfecting the algorithm

Create a larger and more accurate dictionary.
...which can be achieved by perfecting the text parser at RoliSoft.MarkovChain.Tokenizer.Tokenize()
While spellchecking is not required, grammatical analysis is. You can prehaps use Microsoft Office's API for this?
Create a deeper list, so it won't change the subject with its every word.
Analyze and fix verb tenses. This issue is very annoying, especially in languages like Hungarian.