Skip to content

Latest commit

 

History

History
51 lines (39 loc) · 14.9 KB

README.md

File metadata and controls

51 lines (39 loc) · 14.9 KB

TF-IDF-News-Summarization

Methodology for Summarizing News Articles

This code aims to summarize news articles using a TF-IDF based approach. The following methodology has been used:

Loading the Dataset

The news articles dataset is loaded using pandas read_csv function.

Splitting into Train and Test Sets

The dataset is split into a train set and a test set. The train set consists of 90% of the data, while the test set consists of the remaining 10%.

Preprocessing

A function preprocess is defined to preprocess the text data. The function performs the following operations:

Converts the text to lowercase

Tokenizes the text into words Removes stopwords and punctuation Joins the words to form sentences TF-IDF Vectorization The test set is preprocessed using the preprocess function, and then the TF-IDF score is computed for each sentence in the test set using the TfidfVectorizer function from scikit-learn. The parameters used are:

ngram_range=(1, 2): considers both unigrams and bigrams max_df=0.8: ignores words that occur in more than 80% of the documents min_df=5: ignores words that occur in less than 5 documents use_idf=True: uses inverse document frequency weighting Sentence Scoring The sentences in the test set are ranked based on their importance using the TF-IDF scores computed in the previous step. The top 30% of the sentences are selected as the summary for each article.

Cleaning Responses

The selected sentences are removed from the original article to generate the cleaned response.

ROUGE Score Calculation

The ROUGE score is calculated using the Rouge package, which computes the F1 score of the summary compared to the original article.

Output

The cleaned responses for the test set along with ROUGE scores are printed.

Parameters Used

The following parameters were used for the TF-IDF vectorization:

  • ngram_range=(1, 2): The range of n-grams considered for the TF-IDF score. The range is set to consider both unigrams and bigrams.
  • max_df=0.8: The maximum document frequency allowed for a word to be considered in the TF-IDF score. Words that occur in more than 80% of the documents are ignored.
  • min_df=5: The minimum document frequency allowed for a word to be considered in the TF-IDF score. Words that occur in less than 5 documents are ignored.
  • use_idf=True: Whether or not to use inverse document frequency weighting in the TF-IDF score. Inverse document frequency is used to give more weight to words that are rare in the corpus. These parameters were chosen based on empirical experimentation and are known to be effective in generating informative summaries.
Original Content New Content Removed Lines ROUGE Score
After a lot of teasing, Benny Blanco’s collaborative song with BTS and Snoop Dogg is finally here! ‘Bad Decisions’ features BTS’ vocal line members Jin, Jimin, V and Jungkook along with rapper Snoop Dogg, and is a pre-release single from Benny Blanco’s upcoming full-length album Released on August 5 at 9:30 am IST, the light beat of ‘Bad Decisions’ becomes the perfect background for its lyrics that express honest feelings to a loved person The dance track carries a cool feeling, which helps the listener feel refreshed on a hot summer day Meanwhile, the hilarious music video sees Benny Blanco getting ready to attend a BTS concert, only to find out that he has got the date wrong, and to be told that he's made a "bad decision"!Check out the fun music video for ‘Bad Decisions’, below: https://www youtube com/embed/BGNkkVrJZksPrior to the release, two music video teasers were dropped, adopting a style reminiscent of that of a movie trailer With a classic narration accompanying BTS’ shots from their previous music video, the teasers hilariously teased the then-upcoming song, without actually giving away any hint about the song itself or its music video Further, Benny Blanco also kicked off a hunt for ‘Bad Decisions’, by sharing a video which showed him hiding a USB drive containing the song “somewhere in LA” https://twitter com/ItsBennyBlanco/status/1555041143183708160?ref_src=twsrc%5EtfwThe next few hours saw ARMYs coming together to search for the USB drive, with #FindBadDecisions trending on Twitter With an hour and a half to go before the song’s release, Benny Blanco took to Twitter again, this time dropping an address for the USB drive’s location, and asking fans to meet up there!From the teasers to a hunt for a USB drive and more, ‘Bad Decisions’ has been a series of chaotically fun events! Following today’s music video premiere and release of ‘Bad Decisions’ on streaming platforms, the lyric video of the song will be out on BTS’ YouTube channel tomorrow, at 9:30 am IST on August 6 This will be followed by a ‘Visualizer’ released through Benny Blanco’s YouTube channel on August 8, and the ‘BTS Recording Sketch’ video through BTS’ YouTube channel on August 16 What did you think about ‘Bad Decisions’? Share your thoughts with us through the comments! Join the biggest community of K-Pop fans live on Pinkvilla Rooms to get one step closer to your favourite K-Celebs! Click here to join ALSO READ: TWICE drops exciting timetable for 11th mini album ‘BETWEEN 1&2’; Teases opening trailer, album preview & more lot teasing benny blanco collaborative bts snoop dogg finally bts members jin jimin v jungkook along rapper snoop dogg single benny blanco upcoming light beat lyrics express honest feelings loved person dance track carries cool feeling helps listener feel refreshed hot day meanwhile hilarious sees benny blanco getting ready attend bts concert got date wrong told made decision check two teasers dropped adopting style reminiscent trailer classic narration accompanying bts shots previous teasers hilariously teased without away hint benny blanco kicked hunt sharing showed hiding usb drive containing somewhere next hours armys coming together usb drive findbaddecisions trending half go benny blanco time dropping address usb drive location fans meet teasers hunt usb drive series chaotically events following today premiere streaming platforms lyric bts tomorrow followed visualizer benny blanco bts recording sketch bts think share thoughts us comments biggest community fans live pinkvilla rooms get one step closer favourite click read twice drops exciting timetable teases trailer preview bad decisions video music video august song music youtube channel channel ist fun youtube release https album join becomes perfect join also actually la search mini album mini asking features background vocal movie hour released summer giving find saw took twitter opening line {'rouge-1': {'r': 0.7949895182422052, 'p': 1.0, 'f': 0.8853897425565066}, 'rouge-2': {'r': 0.4294801120909789, 'p': 0.6577161015023808, 'f': 0.5182002646562611}, 'rouge-l': {'r': 0.7949895182422052, 'p': 1.0, 'f': 0.8853897425565066}}

Note

Only last paragraph of the output has been shown.