The following would be a starting point for unsupervised learning. It is an IBNF definition of grammar which, when parsed with all ambiguities, returns an abstract syntax tree (forest more precisely) with all possible interpretations of passed stream of characters.
knowledge {
abstract word (
letter {@AnyCharacter \ @WhiteSpace},
next {@word | @Null}
) |
sequence (
element {
@word |
@sequence
},
next {(@WhiteSpace, @sequence) | @Null}
)
}
After parsing, learner code should somehow analyze the forest and induce basic grammar rules. It should find data or elements of sets which are being periodically repeated in each sentence, sequence of words, or in each word (we helped the learner by hardcoding whitespace between words and words could be split in two or more pieces to find morphemes - not shown here). The more data there is to analyze, the more accurate rules would be found by induction.