Editing Twitter Analysis DB Details
Jump to navigation
Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 10: | Line 10: | ||
* I have not tuned the db for performance. Got suggestions ( that are more than just guesses ) Let me know. I am sure that more indexing might help, will try in time. | * I have not tuned the db for performance. Got suggestions ( that are more than just guesses ) Let me know. I am sure that more indexing might help, will try in time. | ||
− | |||
− | |||
− | |||
− | |||
= Debugging = | = Debugging = | ||
Line 21: | Line 17: | ||
* tweaking the parameter file for your computer | * tweaking the parameter file for your computer | ||
* Installing missing modules / packages. | * Installing missing modules / packages. | ||
− | |||
− | |||
The program does not come with an installer You should drop the code into your Python 3.6 development environment. | The program does not come with an installer You should drop the code into your Python 3.6 development environment. | ||
Line 28: | Line 22: | ||
see: [[Twitter Analysis DB]] Configure to Run. | see: [[Twitter Analysis DB]] Configure to Run. | ||
− | + | = What is in the Database = | |
− | |||
− | = | ||
− | |||
− | |||
− | |||
− | |||
First the database is really a database, it is not a csv file or a spreadsheet. To make best use of this tool it really helps to have some understanding of the database, so read this. It is an sql db in sql lite that could be upgraded to a more robust db if worth while. | First the database is really a database, it is not a csv file or a spreadsheet. To make best use of this tool it really helps to have some understanding of the database, so read this. It is an sql db in sql lite that could be upgraded to a more robust db if worth while. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
Currently it consists of three tables: | Currently it consists of three tables: | ||
− | |||
− | |||
− | |||
− | |||
* tweets -- the actual tweet and supporting columns: | * tweets -- the actual tweet and supporting columns: | ||
** tweet_id -- identifies the tweet | ** tweet_id -- identifies the tweet | ||
− | ** tweet_datetime -- when tweeted | + | ** tweet_datetime -- when tweeted |
− | |||
** tweet -- text of the tweet | ** tweet -- text of the tweet | ||
− | ** who -- who tweeted -- so far no real support for multiple tweeters, but not too hard to add to the rest of the application | + | ** who -- who tweeted -- so far no real support for multiple tweeters, but not too hard to add to the rest of the application |
** tweet_type -- a tweet by the author or a retweet | ** tweet_type -- a tweet by the author or a retweet | ||
** is_covid -- and indicator that suggest the tweet is covid related or not. Not in source data, generated by the table load routine, for now see the code. | ** is_covid -- and indicator that suggest the tweet is covid related or not. Not in source data, generated by the table load routine, for now see the code. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
* concord, or concordance -- this is every "word" from all the tweets. A concord.word has be sanitized from the raw tweet, it is always lower case and "dirt" like punctuation has been removed. | * concord, or concordance -- this is every "word" from all the tweets. A concord.word has be sanitized from the raw tweet, it is always lower case and "dirt" like punctuation has been removed. | ||
Line 71: | Line 41: | ||
** word_type -- just a word, or a "@reference" or a hashtag | ** word_type -- just a word, or a "@reference" or a hashtag | ||
** is_ascii -- indicator if the entire word is encoded in ascii, this help identify words that are possibly non-words like emogies. | ** is_ascii -- indicator if the entire word is encoded in ascii, this help identify words that are possibly non-words like emogies. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
* words -- a list ( currently about 300 K ) of words and information on the amount of usage. | * words -- a list ( currently about 300 K ) of words and information on the amount of usage. | ||
** word -- the word ( again all lower case and normalized ) | ** word -- the word ( again all lower case and normalized ) | ||
** word_count -- count of the uses in the analyzed body of text. | ** word_count -- count of the uses in the analyzed body of text. | ||
− | ** word_rank -- index starting at 1 and | + | ** word_rank -- index starting at 1 and acending from the highest_word_count to the lowest. |
== Advanced SQL == | == Advanced SQL == | ||
− | Advanced SQL has been avoided to keep the application fairly independent of the database | + | Advanced SQL has been avoided to keep the application fairly independent of the database used. Some extra "magic" has been use in the Python to provide similar features. Avoiding too many joins also help to keep the db speed up. |
+ | |||
+ | == Output Decoding == | ||
− | == Pseodo Columns | + | == Pseodo Columns == |
+ | |||
+ | Tables can of course be joined, tweets to concord on tweet_id, and concord to words on word, the latter join may of course fail ( produce a null in one table of the other so outer joins should be used. | ||
− | + | Standard sql tools outside of the app can be used to view the data, Sqllite Studio, for example | |
= Building a Database = | = Building a Database = | ||
− | DB building facilities | + | I am working on providing DB building facilities from the GUI. Since this is sensitive to the input sources it only works with the type of input sources I have used. Not everything is in the GUI as of this writing, this will probably change. |
− | |||
− | |||
− | |||
− | |||
− | |||
== Parameters == | == Parameters == | ||
Line 119: | Line 83: | ||
== Run the GUI == | == Run the GUI == | ||
+ | |||
* <Show Load Parameters> will show you the values of some of the parameters used to load the db. If you do not like what you get edit the parameter file. ( and there is a button for that too ) | * <Show Load Parameters> will show you the values of some of the parameters used to load the db. If you do not like what you get edit the parameter file. ( and there is a button for that too ) | ||
Line 128: | Line 93: | ||
== Some Code and Theory == | == Some Code and Theory == | ||
− | |||
− | |||
− | load_tweets_data.py contains the code for defining and loading | + | load_tweets_data.py contains the code for defining and loading the db tables. define_table_tweets() and define table_concord methods ... do the defominational work.... read the code. |
A class TweetFileProcessor performs the table loads. Again read the code. The general idea is: | A class TweetFileProcessor performs the table loads. Again read the code. The general idea is: | ||
Line 154: | Line 117: | ||
I have tried to make the above accurate, but read the code. | I have tried to make the above accurate, but read the code. | ||
− | |||
− | + | TweetTableWriter and ConcordTableWriter are helper classes doing the actual table writing | |
+ | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||