This week I worked on updating my code. I am still waiting on some information, but as soon as I get what I need I should be able to start incorporating it into what I am doing. I still have the code I wrote about PageRank; however, I need to know the form of the information so I can adjust my code to work for it.
I also continued to better my knowledge of Plotly and of GUIs. I believe I have figured out a way to make the graph. I am looking forward to being able to merge all the code I have written. I should be able to do that soon.
Next week I will continue working on building the GUI and merging all my code. I should get the files I need from my other teammates so I will be able to make adjustments if need be.
This week I learned as much as I could about Plotly. I also worked on building a GUI. I am still planning on making it in Python; I just need to know what information I am going to be given. Once I am given the information I need, I am going to start designing a GUI and a graph to fit our program. I spent some time last summer learning about GUIs, but I have not made one quite like this.
Next week I will work on making updates to my code. I will also learn more about GUIs and see if I can make it look more like what I need.
This week, I updated my decoder file to be able to take in 2 articles and compare every sentence in the first article to every sentence in the next article.
import pandas as pd
import numpy as np
article1 = pd.read_csv('article1.csv', names = ["Value", "Sentence"])
article2 = pd.read_csv('article2.csv', names = ["Value", "Sentence"])
combined_articles = 
for sent1 in article1["Sentence"]:
for sent2 in article2["Sentence"]:
combined_articles.append([sent1, sent2, 0])
article1['Value'] = 0
article2['Value'] = 0
article1['Sentence2'] = article2['Sentence']
numCols = len(article1.index)
article1['id'] = np.arange(0,numCols)
with open("test_final.tsv", 'w') as f:
for row in combined_articles:
sent1 = row
sent2 = row
value = row
f.write("\"%s\"\t\"%s\"\t%s\n" % (sent1, sent2, value))
My task: To carry out a Jaccard index vs. Cosine Similarity test and record my results.
Review: I am testing the accuracy of the Jaccard index method vs. the Cosine similarity method when determining the similarity between two sentences:
- How can I be a geologist?
- What should I do to be a geologist?
With my knowledge of the English language, I know that these two sentences are asking very similar, if not the exact same, thing. Thus, when comparing the similarity of their meanings, it should be close to, if not exactly, 1.
Based on the two texts strings we used as a case study, there are 3 known facts:
- The value of our target similarity measurement should be 1 or very close to it.
- Jaccard similarity results in a similarity measurement of 0.40.
- Cosine similarity results in a similarity measure of 0.5774
Comparing the results of our case study from Jaccard similarity and Cosine similarity, we can see that cosine similarity has a better score which is closer to our target measurement. Thus, we can conclude that the cosine works better than the Jaccard method.
This week, I was able to rewrite my code to build the matrix into the format that we need.
f = open(r"test.probs", "r")
count = 0;
data = 
line = f.readline()
# When readline returns an empty string, the file is fully read.
if line == "":
# When a newline is returned, the line is empty.
if line == "\n":
stripped = line.strip()
m = re.match(r'.*\:([\d\.]+).*\:([\d\.]+).*', stripped)
if m is not None:
difference = 0;
# Check if the probability that they entail each other is
# greater than the probability that they don't.
if float(m.group(2)) > float(m.group(1)):
# If the difference in the probability that they entail
# each other is greater than the threshold of 0.5, then
# append to the list called 'data'.
difference = float(m.group(2)) - float(m.group(1));
if difference > 0.5:
data.append((m.group(1), m.group(2), "right"))
# The length of data is the number of sentences that entail each other in the article,
# and this will also be the value that we will need to put into the matrix.
count = len(data)