In this assignment you will download some of the mailing list data from http://mbox.dr-chuck.net/ and run the data cleaning / modeling process and take some screen shots. You will then run two visualizations of the email data you have retrieved and processed: a word cloud to visualize the frequency distribution. While a word cloud might seem a little silly and over-used, it is actually a very engaging way to visualize a frequency distribution or histogram. The word cloud is really a nice continuation of frequency/counting assignments we have been doing in this class. The second visualization will be to show how the data is a timeline to show how the data is changing over time. You are provided the base code for the two visualizations but will need to edit it to improve the data output. Finally you will need to create your own visualization using the spidered data
Here is a copy of the Sakai Developer Mailing list from 2006-2014.
http://mbox.dr-chuck.net/ (Links to an external site.)Links to an external site.
The base program that includes gmane.py, gmodel.py, gword.py and gline.py. It also included sample generated gword.js (with gword.htm) and gline.js (with gline.htm) is found in gmane.zip
You can install the SQLite browser http://sqlitebrowser.org/ (Links to an external site.)Links to an external site. if you would like to to view and modify the databases used for this assignment.
The gmane.py file is provided for you. It operates as a spider in that it runs slowly and retrieves one mail message per second so as to avoid getting throttled. It stores all of its data in a database and can be interrupted and re-started as often as needed. It may take many hours to pull all the data down. So you may need to restart several times. You should download and process at least 1000 messages for the data visualizations to work – but more data is always better.
The base URL (http://mbox.dr-chuck.net/ (Links to an external site.)Links to an external site.) is hard-coded in the gmane.py. Make sure to delete the content.sqlite file if you switch the base url.
Navigate to the folder where you extracted the gmane.zip
Here is a run of gmane.py getting the last five messages of the sakai developer list:
How many messages:10
firstname.lastname@example.org 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments
email@example.com 2005-12-09T13:32:31-06:00 re: sakaiportallogin and presense
firstname.lastname@example.org 2005-12-09T13:42:24+00:00 re: lms/vle rants/comments
The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.
Sometimes there is missing a message. Perhaps administrators can delete messages or perhaps they get lost – I don’t know. If your spider stops, and it seems it has hit a missing message, go into the SQLite Manager and add a row with the missing id – leave all the other fields blank – and then restart gmane.py. This will unstick the spidering process and allow it to continue. These empty messages will be ignored in the next phase of the process.
One nice thing is that once you have spidered all of the messages and have them in content.sqlite, you can run gmane.py again to get new messages as they get sent to the list. gmane.py will quickly scan to the end of the already-spidered pages and check if there are new messages and then quickly retrieve those messages and add them to content.sqlite.
The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. This is intentional as it allows you to look at content.sqlite to debug the process. It would be a bad idea to run any queries against this database as they would be slow.
The second process is running the program gmodel.py. gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.
Each time gmodel.py runs – it completely wipes out and re-builds index.sqlite, allowing you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the data cleaning process.
Running gmodel.py works as follows:
Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00 email@example.com
251 2005-12-22T10:03:20-08:00 firstname.lastname@example.org
501 2006-01-12T11:17:34-05:00 email@example.com
751 2006-01-24T11:13:28-08:00 firstname.lastname@example.org
The gmodel.py program does a number of data cleaing steps
Domain names are truncated to two levels for .com, .org, .edu, and .net other domain names are truncated to three levels. So si.umich.edu becomes umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are forced to lower case and some of the @gmane.org address like the following
are converted to the real address whenever there is a matching real email address elsewhere in the message corpus.
When you are done, you will have a nicely indexed version of the email in index.sqlite. This is the file to use to do data analysis. With this file, data analysis will be really quick.
The first, simplest data analysis is to do a “who does the most” and “which organization does the most”? This is done using gbasic.py:
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants
Top 5 Email list organizations
There is a simple vizualization of the word frequence in the subject lines in the file gword.py:
Range of counts: 33229 129
Output written to gword.js
This produces the file gword.js which has he top 100 words found in the emails. You can view them in a word cloud using the file gword.htm. Once you get gword.py to work you will need to enhance the program to filter the output as follows:
The output should only contain words with letters that are are 4 letters or longer (no numbers)
The output should remove common words (stop words)
The output should remove sakai and email, common words in the output that are not meaningful for the word cloud.
The output should use content from the subjects of the emails.
The filters should be added in place of “words = text.split(” “)” in the sample program.
A second visualization is in gline.py. It visualizes email participation by organizations over time.
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
[‘gmail.com’, ‘umich.edu’, ‘uct.ac.za’, ‘indiana.edu’, ‘unicon.net’, ‘tfd.co.uk’, ‘berkeley.edu’, ‘longsight.com’, ‘stanford.edu’, ‘ox.ac.uk’]
Output written to gline.js
Its output is written to gline.js which is visualized using gline.htm.
Change the gline.py program to show the message count by month instead of by year. You can switch from a by-year to a by-month visualization by changing only a few lines in gline.js. The puzzle is to figure out the smallest change to accomplish the change.
Your Own Visualization:
Once you have gotten the visualization to work for gword and gline you should create one other visualization to display the data in a different way. When creating your own visualization:
it must output data that is different (at least slightly) from that used in gword and gline.
it must use a different chart type than used in gword and gline
You can create a Bubble chart. This chart can be used as an alternative to the word cloud. Instead of JSON data this uses csv data. A sample bubble chart is shown in sampleBubble.htm using the csv data in flare.csv (in the zip file for the assignment). If you were to choose the bubble chart you would need to:
Create a new python file (gbubble.py) that is like your final gmodel.py except that the output is different
Change the output to the csv format seen in flare.csv
Output data for all words with a count of more than 10 (or 50 if you downloaded lots of the data)
Output actual count data instead of the scaled font size for the word cloud.
You could also choose another visualization to use with your data. d3 supports a wide type of visualizations that you can use with your data: https://github.com/d3/d3/wiki/Gallery (Links to an external site.)Links to an external site..
Some other URLs for other visualization ideas:
https://developers.google.com/chart/ (Links to an external site.)Links to an external site.
https://developers.google.com/chart/interactive/docs/gallery/motionchart (Links to an external site.)Links to an external site.
https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats (Links to an external site.)Links to an external site.
https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline (Links to an external site.)Links to an external site.
http://bost.ocks.org/mike/uberdata/ (Links to an external site.)Links to an external site.
http://nltk.org/install.html (Links to an external site.)Links to an external site.
Submitting Your Work
Please Upload Your Submission:
A screen shot of you running the gmane.py application to produce the content.sqlite database.
A screen shot of you running the gmodel.py application to produce the index.sqlite database.
A screen shot of you running the gbasic.py program to compute basic histogram data on the messages you have retrieved.
A screen shot of word cloud visualization for the messages you have retrieved, before you applied the filters.
A screen shot of word cloud visualization for the messages you have retrieved, after you have applied the appropriate filters.
A screen shot of time line visualization for the messages you have retrieved, by year.
A screen shot to the by month visualization for the messages.
A screen shot to the new visualization.
A zip file containing all of the py, js, csv, htm and sqlite files you used as a part of the assignment.
This criterion is linked to a Learning OutcomeA screen shot of you running the gmane.py application to produce the content.sqlite database
This criterion is linked to a Learning OutcomeA screen shot of you running the gmodel.py application to produce the index.sqlite database.
This criterion is linked to a Learning OutcomeA screen shot of you running the gbasic.py program to compute basic histogram data on the messages you have retrieved.
This criterion is linked to a Learning OutcomeA screen shot of word cloud visualization for the messages you have retrieved, before you applied the filters.
This criterion is linked to a Learning Outcomegword.py edited to apply appropriate filters and to get data from the subject of the emails.
This criterion is linked to a Learning OutcomeA screen shot of word cloud visualization for the messages you have retrieved, after you have applied the appropriate filters.
This criterion is linked to a Learning OutcomeA screen shot of time line visualization for the messages you have retrieved, by year.
This criterion is linked to a Learning Outcomegline.py edited to output data by month and year
This criterion is linked to a Learning OutcomeA screen shot to the by month visualization for the messages.
This criterion is linked to a Learning Outcomea new .py file (e.g. gbubble.py) containing code to uptput necessary data for another visualization .
This criterion is linked to a Learning OutcomeA new data file (e.g. gbubble.csv) containing data for the new visualization.
This criterion is linked to a Learning OutcomeA screen shot to the new visualization.
Total Points: 100.0
Our Service Charter
Excellent Quality / 100% Plagiarism-FreeWe employ a number of measures to ensure top quality essays. The papers go through a system of quality control prior to delivery. We run plagiarism checks on each paper to ensure that they will be 100% plagiarism-free. So, only clean copies hit customers’ emails. We also never resell the papers completed by our writers. So, once it is checked using a plagiarism checker, the paper will be unique. Speaking of the academic writing standards, we will stick to the assignment brief given by the customer and assign the perfect writer. By saying “the perfect writer” we mean the one having an academic degree in the customer’s study field and positive feedback from other customers.
Free RevisionsWe keep the quality bar of all papers high. But in case you need some extra brilliance to the paper, here’s what to do. First of all, you can choose a top writer. It means that we will assign an expert with a degree in your subject. And secondly, you can rely on our editing services. Our editors will revise your papers, checking whether or not they comply with high standards of academic writing. In addition, editing entails adjusting content if it’s off the topic, adding more sources, refining the language style, and making sure the referencing style is followed.
Confidentiality / 100% No DisclosureWe make sure that clients’ personal data remains confidential and is not exploited for any purposes beyond those related to our services. We only ask you to provide us with the information that is required to produce the paper according to your writing needs. Please note that the payment info is protected as well. Feel free to refer to the support team for more information about our payment methods. The fact that you used our service is kept secret due to the advanced security standards. So, you can be sure that no one will find out that you got a paper from our writing service.
Money Back GuaranteeIf the writer doesn’t address all the questions on your assignment brief or the delivered paper appears to be off the topic, you can ask for a refund. Or, if it is applicable, you can opt in for free revision within 14-30 days, depending on your paper’s length. The revision or refund request should be sent within 14 days after delivery. The customer gets 100% money-back in case they haven't downloaded the paper. All approved refunds will be returned to the customer’s credit card or Bonus Balance in a form of store credit. Take a note that we will send an extra compensation if the customers goes with a store credit.
24/7 Customer SupportWe have a support team working 24/7 ready to give your issue concerning the order their immediate attention. If you have any questions about the ordering process, communication with the writer, payment options, feel free to join live chat. Be sure to get a fast response. They can also give you the exact price quote, taking into account the timing, desired academic level of the paper, and the number of pages.