Try the codes or catch the exceptions: October 2013

Sunday, October 27, 2013

Case #2: The alpha-padding permutation cipher

Requirements: PHP 5.4+ and Apache web server - both installed and running
Especially for this article, references will be noted by letters instead of numbers.

For this week, I'll be using a different approach of explaining the algorithm that I refer as "understanding in the storm". This will make use of special references within the codes. References will be made to some comments Also after reading some articles on the permutation cipher, I got inspired and developed an enhanced version of this last. Indeed the traditional permutation cipher was shown to be vulnerable to brute force attacks and frequency analysis.

First of all, let me quickly explain how permutation ciphers work (You can skip it though).

1.The traditional permutation cipher
We choose a keyword, and split the plaintext (the text to encrypt) into blocks that are the same length as the keyword. We write this in columns beneath the keyword( It can be represented as arrays in programming). We then label each keyword letter in alphabetical order (if there are duplicates we take them in order of appearance). Now we reorder the columns, so that the numbers are in order (the letters of the keyword are in alphabetical order). We now read across the rows[a].
For instance, suppose you want to process this plaintext using the key beac: kanel is awesome

The two tables show how it goes using the procedure described above.

*b e a c* 2 5 1 3 k a n e l i s a w e s o m e	====>	*a b c e* 1 2 3 5 n k e a i l s w e a m s e o
The original plaintext broken into blocks having the same lenght than the key and sorted in labeled columns		The blocks ordered after sorting the key's digits

Therefore, after encrypting, the result (the ciphertext) will be: nkeails w eamseo. You see now the relation with anagrams. The plaintext and the ciphertext are anagrams in permutation ciphers.
However, this technique has a lot of inconviniences:
(a) The information relating to the length of the key is revealed by the length of the ciphertext
(b) Such permutation is breakable by trying out different permutations until you find a first block that makes sense. Once you have worked out the permutation, you can then use this to decrypt the rest of the intercept easily. That's brute force attack.
(c) Another way to break it is by frequency analysis. Basically, you consider some specific commonly used letters of the permutation's language (english, french,...) that are parts of the ciphertext, then you study their frequency [b] in order to determine which one is more likely in the plaintext and thus, infer the plaintext.

Looking at these, I said to myself: "With such flaws, such cipher is definitely useless". So I started working on a more mature useful permutation-type cipher that I called: the alpha-padding permutation cipher .

2.The alpha-padding permutation cipher

Inspired from the traditional transposition cipher, the essence of this encryption method resides in three(3) main pieces: an alphabet, a padding parameter and the key.

2.1 How it works:
First, you choose a key and a padding parameter in line with the following requirements:
- Both must be numbers
- Both must be contain at least two(2) digits
Now during the encryption, as its predecessor, the original plaintext is broken into blocks having the same length than the key and sorted in labelled columns. But this time, before the blocks are ordered after sorting the key's digits, each character of the plaintext-side content of the columns is mapped to a number based on an alphabet and its position in the alphabet; and the position is then padded using the padding parameter. A child parameter called subpad, based on the key and the padding parameter, is computed and debited substracted from the total. The subpad is the cumulative join of the padding parameter's half within the length of the key.

Example:
Consider again the same plaintext above: kanel is awesome but now for the key, we will take 2513 and for the padding parameter, we choose 10 for instance. We define our alphabet as follows: "00"=>" ","01"=>"a","02"=>"b",
"03"=>"c","04"=>"d","05"=>"e","06"=>"f", "07"=>"g","08"=>"h","09"=>"i",
"10"=>"j","11"=>"k","12"=>"l","13"=>"m","14"=>"n","15"=>"o","16"=>"p",
"17"=>"q","18"=>"r","19"=>"s","20"=>"t","21"=>"u","22"=>"v","23"=>"w",
"24"=>"x","25"=>"y","26"=>"z","27"=>"_");

2 5 1 3 k a n e l i s a w e s o m e	=>	2 5 1 3 11 01 14 05 12 00 09 19 00 01 23 05 19 15 13 05	=>	2 5 1 3 21 11 24 15 22 10 19 29 10 21 33 15 29 25 23 15	=>	1 2 3 5 24 21 15 11 19 22 29 10 33 10 15 11 23 29 15 25
The original plaintext broken into blocks having the same lenght than the key and sorted in labeled columns		Mapping each block's characters to the alphabet		Padding/Adding a noise to the map by adding 10 to the content of each block		The blocks ordered after sorting the key's digits

The resulting ciphertext will hence be: 24211511192229103310151123291525
Finally, the subpad (5555) is substracted from the resulting ciphertext and the new ciphertext is: 24211511192229103310151123285970

In general, one can say that the alpha-padding permutation cipher is a vertical padded sort followed by a horizontal substraction.

For the decryption process, we simply first add the subpad and split the ciphertext into block of lenght 2. Next, for each block, we substract the padding paramater from each block's numbers and map the result to their corresponding alphabet's character. Then we form super-block of length = key's length, index each super-block with the sorted key's digit. Finally, we bring the plaintext back by bringing the indexes(key's digits) to the original order of the key. Without knowledge of the key and the padding parameter, it's almost impossible to crack this cipher.

Pretty simple and powerful; but strict though. As a matter of fact, apart from the initial requirements, two more constraints are added: Because the key's digits serve as array indexes, the key shall not contain repetitive digits (e.g: 1000, 1199,....). Furthermore, how good are the ciphered plaintext and deciphered ciphertext depends on how good is your alphabet. So you might want to revise it.

2.2 The code:
Now, let's take a look at the class behind the alpha-padding process:

class.alphapad.php

<?php

class alphaPad {

   private $key = 0;
   private $padding_par = 0;
   private $alphabet = array ("00"=>" ","01"=>"a","02"=>"b","03"=>"c",
   "04"=>"d","05"=>"e","06"=>"f","07"=>"g","08"=>"h", "09"=>"i",
   "10"=>"j","11"=>"k","12"=>"l","13"=>"m","14"=>"n","15"=>"o",
   "16"=>"p","17"=>"q","18"=>"r","19"=>"s","20"=>"t","21"=>"u",
   "22"=>"v","23"=>"w","24"=>"x","25"=>"y","26"=>"z","27"=>"_");
   private $sub_par = 0;
   private $length = 0;
   private $plaintext = "";
   private $ciphertext = "";
   private $key_a = array();

   public function __construct($key, $padding_par){
       $this->key = $key;
       $this->padding_par = $padding_par;
       $this->length = strlen($this->key);
       $this->key_a = str_split($this->key);
        $this->sub_par = $this-> _sp($this->padding_par, $this->length);
   }

   public function _encrypt($plaintext){
       try{
           $this->ciphertext = "";
           $this->plaintext = $plaintext;
          print "<div id='out'>Plaintext: ".$this->plaintext." <br>(key: ".$this->key." - padding parameter: ".$this->padding_par.")";

           //- split the plaintext into block of lenght = key lenght [1]
           //- index each block using the key [2]
           //- and sort each block by the key's index [3]
           $plain_a = str_split($this->plaintext, $this->length); /*[1]*/

           foreach ($plain_a as $block) { /*[2]*/
               //- if a block is not a multiple of $lenght. e.g: "plai ntex t" for $lenght=4
               $empty_space=strlen($block)% $this->length;
               if( $empty_space > 0) {for($j=1; $j<=($this->length-$empty_space); $j++) {$block .="-";}}

               $block_a = str_split($block);
               $tmp = array(); //- we put all the changes in a temp array first
               for($i = 0; $i < $this->length; $i++){
                   //- $tmp[$this->key_a[$i]] = $block_a[$i]; - For the primitive permutation
               $tmp[$this->key_a[$i]] = (int)array_search($block_a[$i], $this->alphabet)+$this->padding_par; /*[2]*/
                   //- unmatchable index keys will be converted to 0
                   unset($block_a[$i]);
               }
               $block_a = $tmp; unset($tmp);
               ksort($block_a); /*[3]*/
               $this->ciphertext .= implode("",$block_a);
           }
$this->ciphertext = $this->ciphertext - $this->sub_par;
           print "<br>Ciphertext: ".number_format($this->ciphertext, 0, '.', '');
       }
       catch(Exception $e){
            die ("An error occured");
        }

   }

   private function alpha(&$item, $key, $parameter){
       $item = (int)$item;
       $item = $item - $parameter;
       if(strlen($item) == 1) $item ="0".$item;
       if (isset($this->alphabet[$item])) $item = $this->alphabet[$item];
       else $item = "-";
   }

   public function _decrypt($ciphertext){
       try{
           $this->plaintext = "";
           $this->ciphertext = $ciphertext;
            $this->ciphertext = $this->ciphertext +$this->sub_par;
           print "<div id='out'>Ciphertext: ".$this->ciphertext." <br>(key: ".$this->key." - padding parameter: ".$this->padding_par.")";

           //- split the ciphertext into block of lenght = string lenght of the items in $alphabet = 2 [1]
           //- for each block, unpad and map to the alphabet character [2]
           //- form super-block of lenght = key lenght [3]
           //- sort the key in ascending order [4]
           //- index each super-block with the sorted key [5]
           //- bring the plaintext back by bringing the indexes(key digits) to the original order of the key. [6]

           $cipher_a = str_split($this->ciphertext, 2); /*[1]*/

           if($this->length%count($this->ciphertext)==0) {

               array_walk( $cipher_a , array($this,'alpha'), $this->padding_par); /*[2]*/
               $cipher_a = array_chunk( $cipher_a, $this->length); /*[3]*/
               asort($this->key_a); /*[4]*/

               foreach($cipher_a as $block_a){
                   $tmp = array(); $i= 0;
                   foreach($this->key_a as $k=>$v){ /*[5]*/
                       if(isset($block_a[$i])) {
                           $tmp[$v] = $block_a[$i];
                           unset($block_a[$i]);
                       }
                       else $tmp[$v] = "-"; //- e.g: 25 31 21 --
                       $i++;
                   }
                   $block_a = $tmp; unset($tmp);
                   $block_a = $this->referential_ksort($block_a,$this->key); /*[6]*/
                   $this->plaintext .= implode("", $block_a);
               }
               print "<br>Plaintext: ".$this->plaintext;

           }
           else echo "Invalid ciphertext";
       }
       catch(Exception $e){
            die ("An error occured");
        }
   }

   private function referential_ksort($array, $sequence){
       //- sorting by array index and also by following a given sequence
       //- thus all the array keys must be members of the given sequence
       $sequence_a = str_split($sequence);$temp = array();
       foreach ($sequence_a as $s) $temp[$s] = $array[$s];
       unset($array);
       return $temp;
   }

   private function _sp($pad,$len){
       $sp = "";
       for($i=0; $i<$len; $i++) $sp.=(int)($pad/2); //- so that the substraction will always be > 0
       return $sp;
   }
}

2.3 Complexities (Time and Space):

Let l be length of the key, n the length of the original input array
We note n/l as the round up of n/l

For the encryption:
#Time complexity: O(l^n/l)
In the method _encrypt(), during the "for" loop [2], n/l blocks (from the initial block division) are traversed. Then in each loop, we have another loop of l rounds giving a total of l^n/l time units. The computation of the $sub_pad requires l time units because of the "for" loop involved in the process of the method _sp(). However it's not included in the time complexity since it's computed at the same time the class is instantiated.
#Space complexity: O(l+n)
Again in the method _encrypt(), the computation of $key_a requires l space units, all the $block_a/$plain_a instances after splitting the plaintext takes n space units, the transition from $block_a to $tmp and the other fixed variables take only one (1) unit space.

For the decryption:
#Time complexity: O(n/l+(2l)^n/l)
In the quest of applying the method alpha(), the method _decrypt() will take a single time unit to traverse each of the n/l blocks. Similarly to _encrypt, the super 'for' loop will traverse each of the n/l block via another loop of l rounds; but this time, we have a game changer called referential_ksort() in each the l-rounds loop causing a total consumption of 2l^n/l time units for this 2-dimensional traversal.
#Space complexity: O(2l+n)
As in encrypt(), $key_a costs l space units, all the $block_a/$cipher_a take n space units. However the variable $sequence_a in referential_ksort in the 2-dimensional loop will cost an additional l. And we have the transition from $block_a to $tmp and the other fixed variable demanding a single space unit.

2.4 The results:
The test of its code was done in a regular Apache-PHP environment, nothing special) and shows encouraging results

View the results|Download source

Memory overhead: Given a specific cipher, we define the memory overhead as follow:
overhead = (size_of_plaintext / size_of_ciphertext)-1
In the experiment, the following function was in charge of capturing the memory footprint during the ciphering:

function sizeofvar($var) {
   $start_memory = memory_get_usage();
   $tmp = unserialize(serialize($var));
   return memory_get_usage() - $start_memory;
}

Note that this is in no way a reliable method, you can't be sure that nothing else touches memory while assigning the variable, so this should only be used as an approximation. The figure bellow shows the results:

   Loading ...

Execution time: To efficiently test the running time of the cipher over plaintexts of random length, we measured 5 times the running time over 5 different values of the padding parameter (10,47,84,135,206,313) and over 5 different values of the key (12,98,105,376,1010). For each round, we increased the length of the input plaintext. The results are plotted in figure below.

   Loading ...

False positives: By false positives, we mean the scenario where the encryption always works but decryption not. Indeed, during the experimental phase, we found that false positives occur when the key's length is 1 (one) and the value of the padding parameter is 79 and above. That's why we suggest again to use keys containing at least 2 (two) digits.

2.5 Some practices in real-world:

#Name-based URLs: The detailed technique in this article can be used to match a username to a specific code. For instance, in a social network or any other dynamic computer system where user's data are stored in a XML file named using a user-based code (such code can be created by passing the user name to our alpha-padding permutation cipher, you might have a scenario where the user "johndoe" request to see the data of "smith" via the URL: http://www.example.com/smith. This request will load the file 1929302324181210-data.xml (1929302324181210 is the ciphertext for "smith" for the key 2513 and the padding parameter 10).

#Digital signatures: Another practical example of our cipher is found in the identity management area. It can be used to protect files using a collaboration with trusted time-stamping. This scenario will be the topic of our next article. The result of such collaboration will be called Private Trusted Identity. In this case, a document is created by someone called "John Doe Sr.". After signing the document with something he is (his name/social security number/fingerprint/...), the saving procedure will used his name to create a code using our alpha-padding permutation cipher. This code will be then joined to a time-stamp provided by a time-stamp authority to form a private key or digital signature. The document, the private key and the time-stamp will be rowed in a database. During the access to the document by another bud, the digital signature will be verified first before the green light is given.

Extended work:
A deeper research on the alpha-padding permutation cipher includes, but is not limited to:
- The sub-pad parameter (the variable $sub_pad variable in the code): The resulting ciphertext can be really large for some plaintext and therefore might not be supported by some PHP environment. In our presented study, the sub-pad parameter cannot not be longer than the key. Thus, future works will aim to provide a larger sub-pad parameter in order to considerably reduce the length of the ciphertext and hence the inference power of the attacker.
- Another vista to pay attention is the customization of the alphabet. A more diverse and mature alphabet will give a bigger input field from the user's perspective. However such tailoring will require a conscious adjustment of the cipher.

References:
(a) http://crypto.interactive-maths.com/permutation-cipher.html
(b) http://homepages.math.uic.edu/~leon/mcs425-s08/handouts/breaking_tranposition_cipher.pdf
(c) http://web.cs.du.edu/~ramki/courses/security/2011Winter/notes/Lecture4.pdf
(d) http://everything2.com/title/Permutation+cipher
(e) http://stackoverflow.com/questions/2192657/how-to-determine-the-memory-footprint-size-of-a-variable

Thanks for your time and don't forget to comment and share this article, add me on G+ and subscribe to the blog's RSS feeds

Saturday, October 19, 2013

Case #1: An anagram server

Requirements: PHP 5.4+ and Apache web server - both installed and running

As stated in our previous post Anagram: Erewhon is nowhere, this weekend we will be talking about two (2) common uses of anagrams in real-world. The first one is the Anagram server. As a reference, in the episode 16 ("Nameless") of "Grimm" 's season 2, Sgt. Wu uses an anagram solver (or server) to find the name of a murderer based on a subject (remember, the original sequence of alphameric characters; some special characters can be allowed in the sequence too ). Yes, it's not only fiction; What we use all depends on our needs, the available resources and the way we want to fill them.
Let's come back to the implementation and see how it works. First of all, what is an anagram server? It's a system that responds to subject-based search of anagrams over internet. How it works? Basically, once the keyed subject is provided, the server starts a matching process and therefore, retrieve the matches (if any). The matching process can be anything of any technical origin. The most common one, which is the one we'll be mounting later, is a search within a dictionary of anagrams. In this scenario, the patterns (the composing characters) of the subject are stored in a regular expression[2] that is later used to do the anagram matching with the words/phrases/expressions in the dictionary. Note that you can insert diversity in your results by adjusting the dictionary space and include more options like the language, the anagram's topic, ...etc.
The code below is a server (that we call "anaserver") that performs with linear time and space complexities and perform the search for english and french anagrams. In the lines, the result's diversity will be the language. For simplicity measures, we decided to have two(2) different dictionaries: one for english anagrams (data_en.txt) and one for french anagrams (data_fr.txt).According to my dictionary, the result of a subject like "Tom Cruise" will be "I'm so cuter" for english and "Timo sucre" for french. The color marker is the same as last time.

--------------------------------------------------------------------------------------------------------------------------------
anaserver.class.php

<?php
class AnaServer {

    private $language    = "en";              //- to store the language. Default is english
public $results    = array();             //- to store the results of the search
    private $allowed    = "/[^a-z0-9]/i"; //- we only allow alphameric (a to z and 0 to 9) characterics in our subjects

    //- the constructor of the server
    public function __construct($language){
        $this->language = $language;
    }

    //- the matching process during the dictionary search
    public function find($str, $lin){
        $regex = $this->regExp($str);
        $tmp = strtolower(preg_replace($this->allowed,"",trim($lin)));
        if (preg_match($regex, $tmp) && (strlen($str) == strlen($tmp)))
        { array_push($this->results,array(0=>$lin,1=>levenshtein($str, $tmp)));}
    }

    //- the main method in charge of the search
    public function __($str) {
try {
       $str= preg_replace($this->allowed,"",trim($str));
         if (!array_key_exists($this->language, $this->lang)) {

           $lines = @file('data_'.$this->language.'.txt',FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
           foreach ($lines as $line){
                $string = strtolower($str);
                $this->find($string, $line);
            }
         }
         return $this->results;
      }
    catch(Exception $e){
            return array(0 => "Couldn't read the dictionary");
    }
   }

//- to store the pattern of the subject in a regex [see anagram.class.php from the previous post]
   private static function regExp($string){
    $regchars = preg_split('//',$string, -1, PREG_SPLIT_NO_EMPTY);
    $regex = "/";
    foreach($regchars as $char) $regex .= "(?=.*".$char.")";
       return $regex."/";
   }

}
?>

--------------------------------------------------------------------------------------------------------------------------------

As you may have noticed, we did use the built-in PHP function "levenshtein()". For those who don't know, the levenshtein() computes the minimal number of characters you have to replace, insert or delete to transform one string into another one. In our server, it is used to compute the edit distance and has been illustrated for those of you who are interested in further manipulation of anagrams. You now see how simple implementing an anagram server is. However to use such server, we will need a client to instantiate it first and then use the main method "__()" - Same scenario as in the precedent article. To make it more understable via a human-machine interaction, we use a HTLM user interface (index.html) to access the client.

index.html
<!DOCTYPE html>
<html>
<head>
<title> Codijuana - Name/Word to Anagrams </title>
<style>#result1, #result2{padding:0px 5px 5px 5px; background-color:#FF9900; margin-top: 5px;}
</style>
<script type="text/javascript" src="jquery.min.js"></script>
<script>
    
    function trial(str){if (str==""){document.getElementById("result1").innerHTML="";return;}
    if (window.XMLHttpRequest) { xmlhttp=new XMLHttpRequest();} else{ xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");}
    xmlhttp.onreadystatechange=function(){if (xmlhttp.readyState==4 && xmlhttp.status==200){document.getElementById("result1").innerHTML=xmlhttp.responseText;}}
    xmlhttp.open("GET","client.php?q="+str,true);xmlhttp.send();}
</script>
<script>
    
    $('.try').live("click",function() {var dico = $("#dico").val();var act = $("#act").val();var tmp = $("#data").val();var data = tmp.trim();var dataString = 'name='+ data + '&action='+act+'&dico='+dico ;
    if(data.length>0){$.ajax({type: "POST",url: "client.php",data: dataString,cache: false,success: function(html){$("#result2").append(html);}});}
    return false;});
</script>
</head>
<body>
<div><a href="./"><b>Clear the results</b></a> | <a href=""><b>Read the related article</b></a></div>
<div id="trial1">
    <h3>Trial 1: Using the built-in dictionnary</h3>
    <div>Please choose a subject</div>
    <div>
        <select name="name" onchange="trial(this.value)">
           <option value="William Shakespeare">William Shakespeare</option>
           <option value="Chibua Seniores">Chibua Seniores</option>
           <option value="Tom Cruise">Tom Cruise</option>
           <option value="Anagrams Never Lie...">Anagrams Never Lie...</option>
       </select>
    </div>
    <div id="result1"></div>
</div>

<div id="trial2">
    <h3>Trial 2: Customizing the dictionnary</h3>
    <form action='' method='post'>
        word/name/expression/phrase(max length = 100): <input type="text" maxlength="100" name="txt" id="data" value=""> <br>
        action: <select name="action" id="act"><option value="search">search</option><option value="add">add to the dictionary</option></select>
        language: <select name="dictionnary" id="dico"><option value="fr">french</option><option value="en">english</option></select>
        <input type='submit' value='submit' class='try'>
    </form>
    <div id="result2"></div>
</div>
</body>
</html>

---------------------------------------------------------------------------------------------------------------------------------
Now, here is an example of client.

client.php

<?php
    error_reporting(0);
    require_once("class.anaserver.php");
    $lan = array('english'=>'en', 'french'=>'fr');
    $actions = array(0=>'search', 1=>'add');

    $en_matcher = new AnaServer($lan['english']); //- for english anagrams
    $fr_matcher = new AnaServer($lan['french']); //- for french anagrams

    //-trial 1
    if(isset($_GET["q"])) {
        $word = $_GET["q"];
        echo "<strong>English</strong><br>";
        $matchs = $en_matcher->__($word);display($matchs);
        echo "<br><strong>French</strong><br>";
        $matchs = $fr_matcher->__($word);display($matchs);
    }

    //-trial 2
    $valid = "/^[0-9A-Za-z ,.'-]+$/i";
    if(isset($_POST['name']) && preg_match($valid, $_POST['name']) === 1 && in_array($_POST['action'],$actions) && in_array($_POST['dico'],$lan)) {
        $word = $_POST['name'];
        $a = $_POST['action'];
        $d = $_POST['dico'];
        if($a == "search"){
        $matchs = $d=="en"?$en_matcher->__($word):$fr_matcher->__($word);
        display($matchs);
        }
        else {
            $dictionnary = 'data_'.$d.'.txt';
            $dico_words = file($dictionnary);
            if(!in_array($word, $dico_words)){
                $word = "\n".$word;
                $fh = fopen($dictionnary, 'a') or die();
                if(fwrite($fh, $word)) echo "<br>Done! Thanks for your contribution"; else die("bad!");
                fclose($fh);
            }
            else echo "<br>This word/name already exists in our ".array_search ($d, $lan)." dictionnary: Try another word";
        }
    }

    function display($entries){
    if(count($entries)>0) {
        foreach($entries as $e) echo $e[0]." - edit distance:".$e[1]."<br>";
    }
    else echo "No match was found<br>";
    }
?>

View the results|Download source

Extended work
Everything said, anagrams can be really helpful if someone know how to use them. Nonetheless, their reliability highly depends on how rich is your dictionary. And furthermore, even if your dictionary becomes richer (that is bigger), the time performance of the anagram server will gradually tumble with time. A deeper consideration can overcome this issue by including the machine learning technology in the matching process to create a smarter and faster server. Why and how?
Well, why because machine learning can learn from data, and thus from the input subject. How? Instead of using a dictionary as the essence of the matching process, we use build a subject-based model (by training it) and suggest anagrams from this model. Such training can be sourced from the existing anagram-solving techniques [1],[2]. This might be certainly consuming. Lucky us, Google and NASA are working on a quantum technology to advance machine learning [5]. However, this machine learning approach of serving anagrams to the client hasn't been tried yet and need further attention.

The next case will be about implementing another use of anagram, precisely the permutation cipher. Meantime, have a great day! And don't forget to comment and share this article, add me on G+ and subscribe to the blog's RSS feeds

References:
1) http://www.word-buff.com/solving-anagrams.html
2) http://archive.aweber.com/wordbuff/Oncns/h/5_Cunning_Tips_for_Solving.htm
3) http://en.wikipedia.org/wiki/Regular_expression
4) http://words.jspell.com/anagrams-of/
5) http://www.independent.co.uk/life-style/gadgets-and-tech/news/video-a-first-look-inside-google--nasas-quantum-computing-lab-8879787.html

Sunday, October 13, 2013

Anagram: Erewhon is nowhere

Requirements: PHP 5.4+ and Apache web server - both installed and running

In this article, we will learn how to write a code to find anagrams. As a quick introduction, anagrams are the results of a specific word's reordering. For instance, "erewhon" and "nowhere" are anagrams. The original word or phrase is known as the subject of the anagram[1].
After showing their basic implementation in PHP, we'll take a look at their real-world effects and implement some of them.
To start, create a folder in the root directory called "anagram" (you can call it whatever you want). This is where all the files created throughout this tutorial must be saved.
Then inside that folder, create an input text file called "test.txt" (you can call it whatever you want). This is the test containing the anagrams to evaluate and sort. For this tutorial, we'll be using the following sample lines:

pot,pot,opt,godi,dogi,cat,tac
pro,tro,franc,ranfc,rop,water,treaw
root,toor,tro,ort,lamp,foo,malp
kindle,xbox,playstation,gamecube,meucagbe,boxx,ledink,linked,stationplay
book,cook,hook,took,obok,ocok,kooh,off

Insert those lines in the text file "test.txt". Our program will traverse that file per line, and for each line, it will group anagrams together. The file in the program in charge of that job will be called "anagram.class.php" and will be instantiated in the file "index.php" to process the file "test.txt". For readability performance, we use the same color marker as geany.
------------------------------------------------------------------------------------------------------------------------------
anagram.class.php

<?php
namespace Anagram
{
   class Program
   {
     public function main()
    {
       try
       {
        $file = 'test.txt';
        $lines = @file($file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
        foreach ($lines as $line)
            $this->cluster($line);
       }
       catch(Exception $e){
         echo "Couldn't read the text file";
        echo "Caught exception: ", $e->getMessage(),"\n";
       }

    }

    public function cluster($line)
    {
    $regexs = array();   //- to store the regular expressions(regex) or matching patterns of the subjects
    $a_group = array(); //- to store the set of anagrams

    echo "Initial input: ".$line."<br>";
    $words = explode(",", $line);
    usort($words, array('Anagram\Program','cmp')); //- we 1st sort the input array by increasing string length to facilitate the search

    //- Initial values of the regex set and the anagram set
    $pattern = $words[0];
      $regexs[$pattern] = $this -> regExp($pattern);
    $a_group[$pattern] = $pattern;

        for ($i = 1; $i<count($words); $i++) {

            $found = 0;

            foreach ($regexs as $key =>$regex){

//- we simply look through the existing set of regular expressions to see if the current value match one of those. If yes, we add it to the corresponding anagram set
                if (preg_match($regex, $words[$i]) && (strlen($key) == strlen($words[$i])))
                        { $a_group[$key] .= ", ".$words[$i]; $found = 1; }
            }

//- After the above traversal, if there is no match with existing set of regular expressions (thus with the existing set of anagrams), then we add the current value's regex to the set of regexs and to the set of anagrams
            if($found == 0) {
                 $pattern = $words[$i];
                  $tmp = $this -> regExp($words[$i]);
                if(!in_array($tmp, $regexs)) $regexs[$pattern] = $tmp;
                $a_group[$pattern] = $pattern;
            }

        }
        //- Finally, we display the results
       foreach ($a_group as $k => $anagram_set)
        echo "<u>Anagrams of ".$k."</u>: ".$anagram_set."<br>";
        echo "<br>";
    }

    private static function cmp($a, $b){
       return strlen($b)- strlen($a);
    }

    private static function regExp($string){
       $regchars = preg_split('//', $string, -1, PREG_SPLIT_NO_EMPTY); //- we split the pattern to find its corresponding matching patterns
       $regex = "/";
       foreach($regchars as $char) $regex .= "(?=.*".$char.")";
       return $regex."/";
    }
   }
}

?>

index.php

Starting from the bottom

Hello all,

Welcome to my blog. Codijuana is my 2^nd blog and will be more nerdy than its predecessor, The Feeling Brainstorm (for my evasive poetical thoughts - And between you and me, the article Let me dream again is the best. You should check it out).
The originality of this blog (Codijuana) is that I will be giving a sequence of illustrations based on a “started from the bottom now we are here” approach - lol you might ask yourself “wtf, drake?”. Essentially, the point of the concept is to start from a basic concept and expand it to something tremendous. I aim to show how basic codes you write during your coding classes or simple server configuration presented in lab sessions can become a big steak wrapped in a double cheese burger.

Limitations: Since my expertise area is the web, I won't have the resources to cover the software programming and thus will be pay focusing on illustrations and examples within the web for development and server configurations.

The resources include, but are not limited to

- Programming: PHP, MySQL (queries), JavaScript, XML, Ajax, HTML, JQuery library, Perl

- Server configuration: Apache, MySQL, Ubuntu, LAMP

- Expansion of the initial basic knowledge: Wikipedia, Google

Please frequently check this list for updates. And note that the more deeper we will go, the more complex the used tools will become. So make sure to follow every single step.

List of the basic concepts covered throughout this blog:

+ Anagram: Erewhon is nowhere ★★★★★

+ Web notification monitoring: The Push-Pull game ☆ ☆ ☆ ☆ ☆

+ Machine learning: 2D pattern extraction from a raw text ☆ ☆ ☆ ☆ ☆

+ Ubuntu LAMP: SSS (Servers Settings & Security) prior and post deploying a website ☆ ☆ ☆ ☆ ☆

Please frequently check this list for updates.

NOTE: The software license under which I'm making the codes available is the Open Source license; which means redistribution of the codes is allowed.

Don't forget to comment and share this article, add me on G+ and subscribe to the blog's RSS feeds. See you in the next post.