Saturday, October 19, 2013

Case #1: An anagram server

Requirements: PHP 5.4+ and Apache web server - both installed and running

As stated in our previous post Anagram: Erewhon is nowhere, this weekend we will be talking about two (2) common uses of anagrams in real-world. The first one is the Anagram server. As a reference, in the episode 16 ("Nameless") of "Grimm" 's season 2, Sgt. Wu uses an anagram solver (or server) to find the name of a murderer based on a subject (remember, the original sequence of alphameric characters; some special characters can be allowed in the sequence too ). Yes, it's not only fiction; What we use all depends on our needs, the available resources and the way we want to fill them.
Let's come back to the implementation and see how it works. First of all, what is an anagram server? It's a system that responds to subject-based search of anagrams over internet. How it works? Basically, once the keyed subject is provided, the server starts a matching process and therefore, retrieve the matches (if any). The matching process can be anything of any technical origin. The most common one, which is the one we'll be mounting later, is a search within a dictionary of anagrams. In this scenario, the patterns (the composing characters) of the subject are stored in a regular expression[2] that is later used to do the anagram matching with the words/phrases/expressions in the dictionary. Note that you can insert diversity in your results by adjusting the dictionary space and include more options like the language, the anagram's  topic, ...etc.
The code below is a server (that we call "anaserver") that performs with linear time and space complexities and perform the search for english and french anagrams. In the lines, the result's diversity will be the language. For simplicity measures, we decided to have two(2) different dictionaries: one for english anagrams (data_en.txt) and one for french anagrams (data_fr.txt).According to my dictionary, the result of a subject like "Tom Cruise" will be "I'm so cuter" for english and "Timo sucre" for french. The color marker is the same as last time.

--------------------------------------------------------------------------------------------------------------------------------
anaserver.class.php

<?php
class AnaServer {

    private $language    = "en";              //- to store the language. Default is english
    public  $results    = array();             //- to store the results of the search
    private $allowed    = "/[^a-z0-9]/i"//- we only allow alphameric (a to z and 0 to 9) characterics in our subjects

    //- the constructor of the server
    public function __construct($language){
        $this->language = $language;
    }
   
    //- the matching process during the dictionary search
    public function find($str, $lin){
        $regex = $this->regExp($str);
        $tmp = strtolower(preg_replace($this->allowed,"",trim($lin)));
        if (preg_match($regex, $tmp) && (strlen($str) == strlen($tmp)))
        { array_push($this->results,array(0=>$lin,1=>levenshtein($str, $tmp)));}
    }
   
    //- the main method in charge of the search
    public function __($str) {
      try {
         $str= preg_replace($this->allowed,"",trim($str));
         if (!array_key_exists($this->language, $this->lang)) {

             $lines = @file('data_'.$this->language.'.txt',FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
             foreach ($lines as $line){
                $string = strtolower($str);
                $this->find($string, $line);
            }
         }
         return $this->results;
      }
      catch(Exception $e){
            return array(0 => "Couldn't read the dictionary");
      }
   }
   
  //- to store the pattern of the subject in a regex [see anagram.class.php from the previous post]
   private static function regExp($string){ 
      $regchars = preg_split('//',$string, -1, PREG_SPLIT_NO_EMPTY);
      $regex = "/";
      foreach($regchars as $char) $regex .= "(?=.*".$char.")";
       return $regex."/";
   }

}
?>

--------------------------------------------------------------------------------------------------------------------------------

As you may have noticed, we did use the built-in PHP function "levenshtein()". For those who don't know, the levenshtein() computes the minimal number of characters you have to replace, insert or delete to transform one string into another one. In our server, it is used to compute the edit distance and has been illustrated for those of you who are interested in further manipulation of anagrams. You now see how simple implementing an anagram server is. However to use such server, we will need a client  to instantiate it first and then use the main method "__()" - Same scenario as in the precedent article. To make it more understable via a human-machine interaction, we use a HTLM user interface (index.html) to access the client.

index.html
<!DOCTYPE html>
<html>
<head>
<title> Codijuana - Name/Word to Anagrams </title>
<style>#result1, #result2{padding:0px 5px 5px 5px; background-color:#FF9900; margin-top: 5px;}
</style>
<script type="text/javascript" src="jquery.min.js"></script>
<script>
    <!-- using GET method -->
    function trial(str){if (str==""){document.getElementById("result1").innerHTML="";return;}
    if (window.XMLHttpRequest) { xmlhttp=new XMLHttpRequest();} else{ xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");}
    xmlhttp.onreadystatechange=function(){if (xmlhttp.readyState==4 && xmlhttp.status==200){document.getElementById("result1").innerHTML=xmlhttp.responseText;}}
    xmlhttp.open("GET","client.php?q="+str,true);xmlhttp.send();}
</script>
<script>
    <!-- using POST method -->
    $('.try').live("click",function() {var dico = $("#dico").val();var act = $("#act").val();var tmp = $("#data").val();var data = tmp.trim();var dataString = 'name='+ data + '&action='+act+'&dico='+dico ;
    if(data.length>0){$.ajax({type: "POST",url: "client.php",data: dataString,cache: false,success: function(html){$("#result2").append(html);}});}
    return false;});       
</script>
</head>
<body>
<div><a href="./"><b>Clear the results</b></a> | <a href=""><b>Read the related article</b></a></div>
<div id="trial1">
    <h3>Trial 1: Using the built-in dictionnary</h3>
    <div>Please choose a subject</div>
    <div>
        <select name="name" onchange="trial(this.value)">
           <option value="William Shakespeare">William Shakespeare</option>
           <option value="Chibua Seniores">Chibua Seniores</option>  
           <option value="Tom Cruise">Tom Cruise</option>
           <option value="Anagrams Never Lie...">Anagrams Never Lie...</option>
       </select>
    </div>
    <div id="result1"></div>
</div>

<div id="trial2">
    <h3>Trial 2: Customizing the dictionnary</h3>
    <form action='' method='post'>
        word/name/expression/phrase(max length = 100): <input type="text" maxlength="100" name="txt" id="data" value=""> <br>
        action: <select name="action" id="act"><option value="search">search</option><option value="add">add to the dictionary</option></select>
        language: <select name="dictionnary" id="dico"><option value="fr">french</option><option value="en">english</option></select>
        <input type='submit' value='submit' class='try'>
    </form>
    <div id="result2"></div>
</div>
</body>
</html>

---------------------------------------------------------------------------------------------------------------------------------
Now, here is an example of client.

client.php

<?php
    error_reporting(0);
    require_once("class.anaserver.php");
    $lan = array('english'=>'en', 'french'=>'fr');
    $actions = array(0=>'search', 1=>'add');
   
    $en_matcher = new AnaServer($lan['english']); //- for english anagrams
    $fr_matcher = new AnaServer($lan['french']); //- for french anagrams
   
    //-trial 1
    if(isset($_GET["q"])) {
        $word = $_GET["q"];
        echo "<strong>English</strong><br>";
        $matchs = $en_matcher->__($word);display($matchs);
        echo "<br><strong>French</strong><br>";
        $matchs = $fr_matcher->__($word);display($matchs);
    }
   
    //-trial 2
    $valid = "/^[0-9A-Za-z ,.'-]+$/i";
    if(isset($_POST['name']) && preg_match($valid, $_POST['name']) === 1 && in_array($_POST['action'],$actions) && in_array($_POST['dico'],$lan)) {
        $word = $_POST['name'];
        $a = $_POST['action'];
        $d = $_POST['dico'];
        if($a == "search"){
          $matchs = $d=="en"?$en_matcher->__($word):$fr_matcher->__($word);
          display($matchs);
        }
        else {
            $dictionnary = 'data_'.$d.'.txt';
            $dico_words = file($dictionnary);
            if(!in_array($word, $dico_words)){
                $word = "\n".$word;
                $fh = fopen($dictionnary, 'a') or die();
                if(fwrite($fh, $word)) echo "<br>Done! Thanks for your contribution"; else die("bad!");
                fclose($fh);
            }
            else echo "<br>This word/name already exists in our ".array_search ($d, $lan)." dictionnary: Try another word";
        }
    }
       
    function display($entries){
      if(count($entries)>0) {
        foreach($entries as $e) echo $e[0]." - edit distance:".$e[1]."<br>";
      }
      else echo "No match was found<br>";
    }
?>

View the results|Download source

Extended work
Everything said, anagrams can be really helpful if someone know how to use them. Nonetheless, their reliability highly depends on how rich is your dictionary. And furthermore, even if your dictionary becomes richer (that is bigger), the time performance of the anagram server will gradually tumble with time. A deeper consideration can overcome this issue by including the machine learning technology in the matching process to create a smarter and faster server. Why and how?
Well, why because machine learning can learn from data, and thus from the input subject. How? Instead of using a dictionary as the essence of the matching process, we use build a subject-based model (by training it) and suggest anagrams from this model. Such training can be sourced from the existing  anagram-solving techniques [1],[2]. This might be certainly consuming. Lucky us, Google and NASA are working on a quantum technology to advance machine learning [5]. However, this machine learning approach of serving anagrams to the client hasn't been tried yet and need further attention.

The next case will be about implementing another use of anagram, precisely the permutation cipher. Meantime, have a great day! And don't forget to comment and share this article, add me on G+ and subscribe to the blog's RSS feeds

References:
1) http://www.word-buff.com/solving-anagrams.html
2) http://archive.aweber.com/wordbuff/Oncns/h/5_Cunning_Tips_for_Solving.htm
3) http://en.wikipedia.org/wiki/Regular_expression
4) http://words.jspell.com/anagrams-of/
5) http://www.independent.co.uk/life-style/gadgets-and-tech/news/video-a-first-look-inside-google--nasas-quantum-computing-lab-8879787.html

No comments:

Post a Comment