9/30/2015

Find duplicate pdf articles in a 14,000 articles' list by comparing their cosine similarity

Hi,

this is a code in python used to test the similarity between text files.

The 14,000 text files of the test have been generated from published medical articles in pdf. These articles have been converted in text files by pdftotext utility and bash script in RHEL7

Basically what it does is


  1.  load the list of text files in a array from a text file. ( the files are order by size in the list==> an article of 1ko cannot be similar in content of an article of 5 MB) 
  2. for each text files extract the content in a variable file1_content
  3. for each text file around the file extracted previously: (around: use of the variable range_sim_size to  only select adjacent cells around the current file i.e. files with similar higher/lower size  ) 
    1. We extract the content  in a variable file2_content
    2. We compute the cosine similarity with the scikit-learn module
    3. If the cosine is greater than 0.95, we are pretty confidentr that they are the same and we print both file name and their cosine simialrity value. 



 from sklearn.feature_extraction.text import TfidfVectorizer  
 filenames=[]  
 with open("C:\\Temp\\pdfsOrdered.txt","r") as Ins:  
   for line in Ins:  
     filenames.append("C:\\Temp\\pdfs\\" + line.split("\t")[0])  
 for fi in range (0, len(filenames)-1):  
   f = open(filenames[fi], 'r')  
   file1_content=""  
   for line in f:  
     file1_content=file1_content+line  
   f.close()  
   range_sim_size=100  
   for fo in range(0 if fi-range_sim_size/2<0 else fi-range_sim_size/2,len(filenames)-1 if fi+range_sim_size/2>len(filenames)-1 else fi+range_sim_size/2):  
     if fo!=fi and len(file1_content)>0:  
       f = open(filenames[fo], 'r')  
       file2_content=""  
       for line in f:  
         file2_content=file2_content+line  
       f.close()  
       if (len(file2_content)>0):  
         try:  
           file1_content=file1_content.decode('cp1252')  
         except:  
           try:  
             file1_content=file1_content.decode('utf-8')  
           except:  
             try:  
               file1_content=file1_content.decode('latin-1')  
             except:  
               file1_content=file1_content.encode('ascii','ignore')  
         try:  
           file2_content=file2_content.decode('cp1252')  
         except:  
           try:  
             file2_content=file2_content.decode('utf-8')  
           except:  
             try:  
               file2_content=file2_content.decode('latin-1')  
             except:  
               file2_content=file2_content.encode('ascii','ignore')  
         vect = TfidfVectorizer(min_df=1)  
         pass  
         try:  
           tfidf = vect.fit_transform([file1_content,file2_content])  
           coli=(tfidf * tfidf.T).A[0,1]  
           if (coli>0.95):  
             print str(coli) + "\t" +filenames[fi] + "\t" + filenames[fo]  
         except:  
           pass  

The file containing the list of text file has the filenames in the first column and the filesize in byte in the second column. You can find a part of the file as an example here:

3004.txt 2876089
3002.txt 2731653
2826.txt 2496024
1499.txt 2217977
1832.txt 1893984
1833.txt 1893984
4182.txt 1598168
4183.txt 1598168
7456.txt 1549750
7457.txt 1549750
2378.txt 1537610
1150.txt 1468884
3005.txt 1399347
3003.txt 1330662
2829.txt 1256621
2824.txt 1232311
1830.txt 1135066
1831.txt 1135066


The output of the application is

0.988557658543 C:\Temp\pdfs\3004.txt C:\Temp\pdfs\3002.txt
0.974692907929 C:\Temp\pdfs\3004.txt C:\Temp\pdfs\2826.txt
0.979908614303 C:\Temp\pdfs\3004.txt C:\Temp\pdfs\3005.txt
0.966689002546 C:\Temp\pdfs\3004.txt C:\Temp\pdfs\3003.txt
0.988557658543 C:\Temp\pdfs\3002.txt C:\Temp\pdfs\3004.txt
0.983455575607 C:\Temp\pdfs\3002.txt C:\Temp\pdfs\2826.txt
0.964582084778 C:\Temp\pdfs\3002.txt C:\Temp\pdfs\3005.txt
0.978862084702 C:\Temp\pdfs\3002.txt C:\Temp\pdfs\3003.txt
0.97472262964 C:\Temp\pdfs\2826.txt C:\Temp\pdfs\3004.txt
0.983454833782 C:\Temp\pdfs\2826.txt C:\Temp\pdfs\3002.txt
0.956757108917 C:\Temp\pdfs\2826.txt C:\Temp\pdfs\3003.txt
0.999990842014 C:\Temp\pdfs\1832.txt C:\Temp\pdfs\1833.txt
0.999990842014 C:\Temp\pdfs\1833.txt C:\Temp\pdfs\1832.txt

I will probably use excel to remove the duplicates by:

  1. ordering the two similar files in each row in a third column and concatenate the ordered list as a string. 
  2. removing the duplicate of the newly created third column. 
  3. Voila I get a list of similar non-duplicate tuple. 
  4. Since ther could be three tuple non similar referncing all the same file, I will now create cluster of duplicates.... 

No comments:

Post a Comment