## CS 4641 Project: Group 24 (Kevin Chen, Linsey Chen, Stella Chen, Savannah Chunn, Sanders Lee)

View on binhphap.vn Anime Recommendation Engine## Motivation

Anime is a khung of animated truyền thông with origins tied to lớn nhật bản. A recent Google trover revealed that there are between 10-100M searches for anime related topics every month . This number has only just peaked in recent months as a result of nation-wide quarantine orders & subsequent efforts lớn find an entertainment medium . Our goal is lớn apply machine learning lớn recommend the best anime for a user to lớn watch based on their personal favorites. Recommendation engines can be built using the techniques of either collaborative or content-based filtering. Due lớn the limitations of our datamix, our implementation involved using content-based filtering with a modified KNN. To enhance the mã sản phẩm and provide only the best of recommendations, we used a combination of dense, categorical, and textual features.quý khách sẽ xem: 20trăng tròn anime watching challenge

## Data

### Dataset Description

We utilized a datamix that we found through a binhphap.vn project called tidy.csv that had been constructed from cleaning a kaggle dataset . In our complete original dataset, we have 77,911 records with each consisting of 28 features. These features include:

Figure 5: Anime Count by Decade of Premier

### Pre-processing

Before we were able to use the data, we first had to lớn clean it by removing the unnecessary columns and replacing NA values with 0s. Although our dataphối had 77,911 rows, many of these rows were duplicated multiple times for a single anime title. For example, the anime Cowboy Bebop was duplicated 17 times, once for each genre, each studio, and/or each producer that worked on the anime. To clean this up, we grouped all the anime together by title, và consolidated the information khổng lồ remove sầu the duplicated rows - ultimately condensing our dataset from 77,911 rows to 2,856 chất lượng anime. Following this, we also one-hot encoded all of the categorical data columns (i.e. genre, studio, source, producers, rating, type). One-hot encoding not only reduced the number of rows in our dataset by ensuring that each anime only occupied one row, but also prepared the datamix for constructing the vectors during the data modelling phase.

Bạn đang xem: Liên minh fansub

In addition khổng lồ the categorical data columns, our dataset conveniently held a wealth of information for us in the form of a textual synopsis for each anime. To utilize of this, we used a pretrained word2vec Mã Sản Phẩm by Google that was trained on the Google News corpus (over 300 billion words) lớn output 300-dimensional word vectors. The idea was khổng lồ use the word embeddings khổng lồ capture the semantics of the summary in an attempt to use these features lớn find other anime with similar summaries in semantics. In order to ensure that the đầu vào lớn the model was standardized, the synopsis for each anime was pre-processed lớn ensure that they were properly formatted và consisted of only words of interest. We removed all punctuations & capitalization, as well as comtháng words such as “a”, “an”, & “in” using the list of default stopwords used by MySQL’s MyISAM tìm kiếm indexes . This significantly reduced the amount of words we were working with as the form size of our word ngân hàng decreased from 34354 khổng lồ 21259, & the maximum length of the synopses decreased from 540 to lớn 290. We then computed a 1x300 **synopsis summary vector** for each anime by plugging in every word of the synopsis into lớn the word2vec mã sản phẩm and averaging all of the vectors. lưu ý, fictional words specific to an anime (such as “Geass” or names lượt thích “Lelouch”) may not generate a resulting word embedding, in which case the word is simply ignored in the final calculation of the synopsis summary vector.

Figure 6: Synopsis summary vector

Ultimately, each anime had a corresponding feature vector of shape 1x414. To better understand our feature mix & intrinsic relationships amongst features, the following correlation matrices (performed on subsets of features for visibility) were generated:

Figure 7: Correlation matrix for stats và genre features

The above sầu *stats* correlation matrix shows many expected behaviors. For example: a very svào negative sầu correlation between score & ranking, & a very svào positive sầu correlation between members & number of favorites. Likewise, there are relatively svào positive correlations between the genres of “Ecchi” & “Harem”, và “Fantasy” và “Magic”. Particularly interesting was the fact that anime with the genre “Kids” had a much higher chance of being popular while anime labelled as “Romance” were more likely khổng lồ be less popular.

Figure 8: Correlation matrix for stats & producer features

The above sầu correlation matrix shows the correlation matrix for the subphối of our features containing information on the producer. While there were many producers lớn consider, the more notable ones: Aniplex, a flagship animation company owned by Sony, & Dentsu, Japan’s largest advertising company, had positive sầu correlations with respect khổng lồ their scores, number of favorites, & number of members.

### PCA

Due to lớn the fact that our feature space was so large (primarily as a result of using textual features), we attempted to reduce the feature space by using PCA. By graphing the summed captured variance of each component, we deduced that using 300 components out of the total 412 was suitable for our needs as it covered 98% of the variance of our feature mix. This PCA’ed version of our feature phối was then used in our KNN Mã Sản Phẩm lớn find the best anime recommendations.

Figure 9: Captured variance of 300 components was 98%

In an attempt lớn better visualize the feature space, and the relative sầu space and groupings of anime, we used PCA to lớn convert down to 2D space. It is important to lớn note that using 2 features only captures 12.2% of the total variance in our feature set, & thus the feature space visualization is not optimal but merely serves as a visualization lớn gain a better understanding of the dataphối.

Figure 10: PCA of feature space inkhổng lồ 2 chiều space

### DBSCAN

The PCA graph in 2 dimensional space showed clearly distinct clusters of anime which made us wonder exactly how these clusters formed và what type of anime were represented in each cluster. To tackle this problem, we converted our feature space to lớn 300 dimensions (same feature space as our input khổng lồ KNN), and performed DBSCAN, an unsupervised clustering algorithm. In order to lớn properly use DBSCAN, we tuned the *minpts* parameter by hand such that not all the points were located in one cluster nor were there an exceptionally large number of noise points. lưu ý, we could not use the heuristic of minpts Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster Outlier

Figure 13: Top-15 anime represented in each cluster

## Modelling & Results

### Modelling

The KNN algorithm seeks to lớn find the k most similar anime lớn the current anime. However, often times it is very difficult for users khổng lồ be able to capture the full breadth of their anime preferences in a single anime. In our modified KNN algorithm, we allow users to input đầu vào an arbitrary amount of anime that they like in an attempt khổng lồ better understand and recommover anime catered to their preference. Assume a user inputs *n* different anime that they enjoyed. To mã sản phẩm this, we average out the *n* feature vectors of each of those anime và compute KNN on this new vector that ideally captures the essence of each of their preferred animes.

Figure 14: KNN input đầu vào vector

và measures the angle between our input average feature vector & each of the feature vectors for anime in the dataphối. We preferred cosine similarity as a distance measurement due to lớn the way our datamix values were distributed.To process our data, we one-hot encoded our categorical data values, lượt thích genre, studio, & source. These columns were represented in our processed data in 1s & 0s. In comparison, our originally quantitative feature data values, such as episodes, which had values ranging from 1 to 1787, và scored_by, with minimum at 8 & maximum value 1107995, were much greater than our one-hot encoded values, & could possibly skew our KNN results towards the originally quantitative features. With this in mind, we implemented Cosine similarity as a distance measurement because it focuses on the angle between the vectors, và does not consider the respective sầu weights or magnitudes of the vectors.

Figure 16: Anime Datamix example data, genre_kích hoạt (far right) is an example of one-hot encoding of categorical feature genre

Our alternative sầu distance metric was using Euclidean distance, measured by:

Euclidean distance, in contrast to lớn Cosine distance, is similar to measuring the actual distance between the two vectors, & is thus affected by angle and magnitude of the vectors. We implemented Euclidean distance as an alternative distance measurement because we were interested in seeing how the different distance functions would perkhung comparatively khổng lồ each other.

For our KNN implementation, we compare the distance values of each feature vector to lớn our đầu vào average vector. When considering Euclidean distance, this can be compared directly (ex. d(x1,average) = 7.8 EXAMPLE 1: From a single anime

We first chose a single anime khổng lồ kiểm tra our KNN model with. This represents an input đầu vào set with fully minimized variability. Recommendations are as follows:

Cosine Unaltered Cosine Normalized Euclidean Unaltered Euclidean NormalizedSTD Input Distance | 1.11e-16 | 2.22e-16 | 0 | 0 |

Distances | - Sword Art Online: 4.53e-05- Dragon Ball Z: 4.82e-05- Code Geass: Lelouch R2: 5.28e-05- Death Note: 5.83e-05- One Punch Man: 1.59e-04 | - Attack on Tirã S2: 0.26- Fullmetal Alchemist: Brotherhood: 0.36- Death Note: 0.38- Code Geass: Lelouch: 0.40- Code Geass: Lelouch R2: 0.44 | - Sword Art Online: 68802.63- Death Note: 132434.60- Fullmetal Alchemist: Brotherhood: 261364.26- One Punch Man: 384929.08- Tokyo Ghoul: 459418.36 | - Attaông xã on Titung S2: 17.51- Code Geass: Lelouch: 21.16- Code Geass: Lelouch R2: 21.60- Fullmetal Alchemist: Brotherhood: 22.11- Akame ga Kill: 22.31 |

AVG Distances | 7.29e-05 | 0.37 | 261389.78 | trăng tròn.94 |

**Quantitative Feature Comparisons from EXAMPLE 1**

**popularity** (Mean: 2988.34, St. Dev: 2868.05)

Cosine | no | 0.01 | 32.42 |

Cosine | yes | 0.01 | đôi mươi.45 |

From the above sầu table for popularity feature standard deviation, we can see that the popularity values of Cosine un-normalized KNN results are on average further from the input đầu vào average of the popularity feature compared to lớn the Cosine normalized KNN. This is directly opposite from our feature analysis of scored_by results. However, it should be the popularity of an anime is inversely proportional lớn its value for the popularity feature. For example, an anime with popularity feature value 4 is mmore popular than an anime with popularity feature value 200. It is likely Cosine normalized KNN performed better than Cosine un-normalized KNN for the popularity feature as our input đầu vào anime had a popularity of 2, which is a small value and is likely less skewed when normalized.

**members** (Mean: 100507.59, St. Dev: 164257.15)

Cosine | no | 2.45 | 516539.41 |

Cosine | yes | 2.47 | 474466.31 |

From the above sầu table for members feature standard deviation, we can see that the members values of Cosine normalized KNN results are on average further from the input đầu vào average of the members compared to lớn the Cosine un-normalized KNN. This is likely because our input đầu vào members value was 1500958, a high value that may have sầu been skewed by normalization as the members feature also has a high range (52 to lớn 1610561).

**favorites** (Mean: 1610.34, St. Dev: 6211.04)

Cosine | no | 4.75 | 31280.51 |

Cosine | yes | 5.22 | 38706.69 |

The results for the favorites feature was similar to that of the members feature. Like members, the favorites feature has a large range (0 to 120331) & our input anime had a high favorites value of 70555 (3rd quartile).

If we compare Average Absolute Standard Z between our quantitative sầu features, favorites had the largest average absolute standard Z. We can expect this, because the favorites feature has a large range of values (from 0 khổng lồ 120331) & a moderately high variance (6211.04) for its range. Of the features, popularity had the lowest average absolute standard z. Although the range of feature popularity is relatively large (from 1 to lớn 15013), the data distribution for popularity is right-skewed:

& the bulk of the data for popularity is small in value. Because of this distribution, we were able to get results with small variance based on our đầu vào popularity, 2. In contrast, if we were to run KNN for an đầu vào with larger popularity feature, we would get significantly different results (see below).

**Partial feature chạy thử, Median popularity input**(Mean: 2988.34, St. Dev: 2868.05)

Cosine | no | 0.05 | 180.49 |

Our resulting average absolute standard Z for our popularity feature from this demo is 0.05, which is much greater than our EXAMPLE 1 thử nghiệm results (average absolute standard Z: 0.01).

EXAMPLE 2: From a single series of animeFor this KNN thử nghiệm, we selected a series of anime to represent a very closely associated anime input đầu vào mix. We ran our KNN Mã Sản Phẩm and received the following results:

Cosine Unaltered Cosine Normalized Euclidean Unaltered Euclidean NormalizedSTD Input Distance | 1.11e-16 | 2.22e-16 | 0 | 0 |

Distances | - anohana: 9.24e-06- Madoka Magica the Movie: 1.40e-05- Kuroko’s Basketball 1.42e-05- Vampire Knight: 2.51e-05- Maid Sama!: 2.68e-05 | - Gun Samurai Recap: 0.12- Marches Comes in Like a Lion: 0.18- Berserk: Recollections: 0.24- So, I Can’t Play H!: 0.26- Tsukigakirei: First Half: 0.31 | - Miss Kobayashi’s Dragon Maid: 10003.85- Rosario + Vampire: 10933.50- My Teen Romantic Comedy: 13918.15- GATE: 16494.10- JoJo’s Bizarre Adventure: 18196.80 | - Marches Comes in Like a Lion: 21.31- Persona 4 the Animation: 27.07- Fullmetal Alchemist: Premium: 29.63- Shiki Specials: 29.80- Robot Girls Z: 30.68 |

AVG Distances | 1.79e-05 | 0.23 | 13909.29 | 27.70 |

**Quantitative Feature Comparisons from EXAMPLE 2**

**members** (Mean: 100507.59, St. Dev: 164257.15)

Cosine | no | 1.03 | 192446.07 |

Cosine | yes | 2.21 | 376356.68 |

Euclidean | yes | 2.38 | 391837.88 |

Euclidean | no | 0.50 | 172492 |

For members, both un-normalized KNN had improved average absolute standard Z values (lower), opposed to lớn the normalized average absolute standard Z scores.

Xem thêm: 10 Phiên Bản Game Call Of Duty Nào Hay Nhất Từng Ra Mắt, Top Game Call Of Duty Hay Nhất

**favorites** (Mean: 1610.34, St. Dev: 6211.04)

Cosine | no | 0.94 | 6467.43 |

Cosine | yes | 1.90 | 11806.14 |

Euclidean | yes | 1.92 | 11906.55 |

Euclidean | no | 1.08 | 7654.53 |

**One-Hot Feature Comparisons from EXAMPLE 2**

For this series of comparisons, the mean value for one-hot feature represents the percentage of the data that has this feature. Some features have relatively high proportions, such as genre_Comedy, which has a mean value of 0.45 (or 44.86% of the data). In comparison, other features represent a very small percentage of the data, such as studio_Madhouse, which has a mean of 0.05, representing a 5.49% of the data.Additionally, we use Absolute average difference as a measure thử nghiệm how similar our results were to the đầu vào. It is calculated by:

where x-bar is the average feature value from the anime recommendations & mu is the average feature value from the inputs.

**genre_Action** (Mean: 0.40, St. Dev: 0.49)

Cosine | no | 1.75 | 0.86 | 0.86 |

Cosine | yes | 1.46 | 0.77 | 0.66 |

Euclidean | yes | 1.75 | 0.86 | 0.86 |

Euclidean | no | 1.46 | 0.77 | 0.66 |

On average, the tests were about evenly distributed, with Cosine normalized và Euclidean un-normalized performing slightly better than the other two tests. However, as our inputs formed a concentrated phối with with moderate variance, so we expect some randomness in our demo results.

**studio_Madhouse** (Mean: 0.05, St. Dev: 0.23)

Cosine | no | 0 | 0 | 0 |

Cosine | yes | 0 | 0 | 0 |

Euclidean | yes | 0 | 0 | 0 |

Euclidean | no | 0 | 0 | 0 |

From our resulting variance measurements, we can see that for one-hot features with very low population represention (small probability), we cannot expect good measurements for how well our recommendations did relative sầu to the đầu vào, as most possible data animes fall outside this tiny portion of our data. This is especially exemplified by our measurements from studio_Madhouse values for average absolute standard Z and average standard feature deviation; several times, the values were both 0. However, this value cannot necessarily signify perfect recommendation results for this feature, given the input anime. Instead, this measurement tells us that from our anime dataset, we vì not have enough values in our anime dataphối lớn accurately measure our KNN performance with regards khổng lồ the feature in question.However, with regards to our results, we can say with relative sầu confidence that because we had a mix of input đầu vào anime with a mean studio_Madhouse feature value of 0 (meaning, none of the animes were created by studio_Madhouse) we would expect for our recommendations lớn return non-studio_Madhouse’s animes.

As with the feature comparison trends, overall Cosine un-normalized KNN results prioritized high valued quatitative features over small value features such as one-hot encoded features. In contrast, Cosine normalized KNN produced results that were heavily impacted by one-hot encoded data values like our synopsis encoded data. Seen below is an excerpt of our input đầu vào synopses:

which heavily featured words like “recap” and “episode.” Interestingly, our resulting recommendations from Cosine normalized KNN also produced recommended animes based on these wordings (see below).

Similarly, our Euclidean normalized KNN results also were heavily based on our synopsis key wordings (as seen below):

In contrast, our Euclidean un-normalized results were heavily based on high values quantitative features such as members, & did not give results similar to our on-hot encoded features. We can conclude from these results that normalizing our data is imperative sầu lớn giving equal emphasis to our one-hot features and quantitative sầu data features, but may result in skew due to normalizing high value quantitative sầu feature values.

EXAMPLE 3: From a relatively similar assortment of animeUsing our personal anime knowledge and experiences, we selected a phối of anime that were relatively closely associated for our KNN model testing. Below are our KNN Model recommendations:

Cosine Unaltered Cosine Normalized Euclidean Unaltered Euclidean NormalizedSTD Input Distance | 1.73 e-03 | 0.29 | 1149911.69 | đôi mươi.27 |

Distances | -Fullmetal Alchemist: 7.40e-06 -Future Diary: 9.45e-06 -Elfen Lied: 9.74e-06 -Parasyte: 2.14 e-05 -My Teen Romantic Comedy: 2.59e-05 | -Fullmetal Alchemist: Brotherhood: 0.50 -My Hero Academia: 0.51 -Code Geass: Lelouch: 0.52 -Death Note: 0.52 -Code Geass: Lelouch R2: 0.52 | -Ouran High School Host Club: 8961.68 -Maid-Sama!: 13454.21 -My Teen Romantic Comedy: 15365.79 -Princess Mononoke: 18975.94 -Overlord: 19197.70 | -JoJo’s Bizarre Adventures: Diamond is Unbreakable: 12.12 -Re:CREATORS: 12.39 -Akame ga Kill!: 12.40 -Drifters: 12.47 -JoJo’s Bizarre Adventure: Stardust Crusaders: 12.76 |

AVG Distances | 1.47e-05 | 0.52 | 15191.06 | 12.43 |

**One-Hot Feature Comparisons from EXAMPLE 3**

**genre_Comedy** (Mean: 0.45, St. Dev: 0.50)

Cosine | no | 1.11 | 0.60 | 0.35 |

Cosine | yes | 1.11 | 602 | 0.35 |

Euclidean | yes | 1.11 | 0.60 | 0.35 |

Euclidean | no | 0.90 | 0.51 | 0.15 |

From the above feature comparison, we can see that of the KNN tests, Euclidean un-normalized had the worst performance. This most likely because of how the Euclidean formula is defined to consider weights và magnitudes of the vectors in comparison &, because the thử nghiệm was also not normalized, the final outcome was biased towards largest quantitative features, such as members, or favorites (which we can see from the previous favorites and members feature comparison tables that Euclidean un-normalized had the highest accuracy for, out of all our KNN tests).

EXAMPLE 4: From different anime genresWe selected our animes in this thử nghiệm by choosing 5 random genres (Horror, Ecđưa ra, Comedy, Magic, và Romance) và then choosing an anime from the top of the các mục of anime in the corresponding genre. We then put this mix of anime inkhổng lồ our KNN Model. The results were as follows:

Cosine Unaltered Cosine Normalized Euclidean Unaltered Euclidean NormalizedSTD Input Distance | 8.94e-05 | 0.54 | 50161.62 | 12.04 |

Distances | -anohana: 1.02-05-Parasyte: 1.36e-05-Elfen Lied: 1.36e-05-Future Diary: 2.93e-05-Vampire Knight: 3.67e-05 | -Naruto: Shippuden: 0.51-Bleach: 0.53-Dragonball Z: 0.54-Tokyo Ghoul √A: 0.59-Reborn!: 0.59 | -Haikyu! 2: 6305.24-Nisemonogatari: 10319.20-School Days: 12258.90-Wolf Children: 12704.43-Kuroko’s Basketball 2: 12971.85 | -JoJo’s Bizzare Adventure: Stardust Crusaders: 11.15-Drifters: 11.24-Jojo’s Bizarre Adventure: 11.54-Evangelion 3.0: 11.63-Re:CREATORS: 11.68 |

AVG Distances | 2.38e-05 | 0.55 | 10911.93 | 11.45 |

**One-Hot Feature Comparisons from EXAMPLE 4**

**genre_Drama** (Mean: 0.27, St. Dev: 0.44)

Cosine | no | 1.81 | 0.89 | 0.80 |

Cosine | yes | 0.45 | 0.45 | 0.20 |

Euclidean | yes | 0.91 | 0.63 | 0.40 |

Euclidean | no | 0.91 | 0.63 | 0.40 |

From the genre_Drama feature table above, we can see that Cosine normalized performed the best, with a lowest absolute average difference of 0.2, while Cosine un-normalized had the worst performance, with absolute average difference of 0.80.

**Comparative sầu Quantitative sầu Feature Comparisons, EXAMPLE 3 and 4**

*EXAMPLE 3 (similar inputs)*

*EXAMPLE 4 (different inputs)*Distance | Normalized | Avg Abs St. Z | Avg Abs St. Z | |

members | Cosine | no | 1.76 | 2.33 |

Cosine | yes | 3.63 | 1.96 | |

Euclidean | yes | 1.91 | 0.03 | |

Euclidean | no | 0.04 | 1.15 | |

favorites | Cosine | no | 0.87 | 1.45 |

Cosine | yes | 9.01 | 2.81 | |

Euclidean | yes | 1.46 | 1.03 | |

Euclidean | no | 0.07 | 1.44 |

From the above sầu table, we can see that for a similar mix (Example 3), both un-normalized KNN tests perkhung more accurately for large quantitative features. From the Example 4 results, we can see that regardless of input đầu vào phối variability, Cosine normalized performs the badly for large quatitative sầu features, which we expect as it both disregards magnitude & weight of vectors (Cosine distance) và has been rebalanced (normalization) so that quantitative features và one-hot features are weighted more evenly. lưu ý that we cannot decisively conclude from our Example 4 results that un-normalized KNN tests always perkhung more accurately for large quantitative sầu features. However, the results from Example 4 may be possibly affected by variabilty of input đầu vào animes.

**Comparative sầu One-Hot Feature Comparisons, EXAMPLE 3 & 4**

*EXAMPLE 3 (similar inputs)*

*EXAMPLE 4 (different inputs)*Distance | Normalized | Avg Abs St. Z | Abs Avg Diff | Avg Sq St. Dev | Abs Avg Diff | |

genre_Comedy | Cosine | no | 1.11 | 0.35 | 1.61 | 0.80 |

Cosine | yes | 1.11 | 0.35 | 0.64 | 0 | |

Euclidean | yes | 1.11 | 0.35 | 0.88 | 0.20 | |

Euclidean | no | 0.90 | 0.15 | 1.37 | 0.60 |

From the above one-hot comparison, we can see that Cosine normalized performs better given more input đầu vào variability. This can be tied bachồng khổng lồ our Example 2 results; the translated text encoded values were concentrated around specific words (specifically “recap” & “episode”) and the resulting anime recommendations were skewed toward animes with similar text synopses, rather than even coverage of all one-hot & quantitative sầu features. In contrast, when there is more input variability, there is less chance to lớn bias the Cosine normalized output khổng lồ a specific encoded feature. For a similar mix of inputs, we can see that Euclidean un-normalized performs slightly better than the other tests. However, we cannot definitively say whether this is due to lớn randomness or not.Additionally, we can identify that for both similar anime sets và different anime sets Cosine un-normalized performed the worst for one-hot encoded features.

**Overall Analysis**

From our results, we can see for our dataset that on average, Euclidean un-normalized KNN preformed the weakest (highest average output distance). This is likely due to the range of values we have sầu in our dataphối. We processed our categorical data into one-hot encoding, as well as retained quantitative sầu values. In comparison, the range and variation of the quantitative values are very high. For example, quatitative feature scored_by has a range from 8 to 1107955, mean of 51396.65, & a standard deviation of 96648.63. Without normalization, using Euclidean distance, which accounts for weight of vectors, as well as the angle between them, will be skewed toward higher values, such as scored_by. In contrast, Cosine un-normalized KNN did a better job for considering quantiative data features.

However, lớn properly take in our NLPhường one-hot encoded synopsis data, we should use normalized KNN for better results. This accuracy is improved when a set input anime have closely overlapping or related words. For instance, from our Example 3 Cosine normalized KNN chạy thử, the input đầu vào anime synopses shared words like “human”, “hero”, “villain”, “criminal”, “fight”, and “school”. In comparison, the corresponding anime recommendations featured words also featured related words, such as “human”, “killer”, “hero”, “school”, “criminal”, “vigilante”. However, this also has its own downfalls, as quantitative sầu values và one-hot encodes data are normalized khổng lồ even their weights, more recommendations become heavily dependent on one-hot data. For example, in EXAMPLE 2, specifically the Cosine normalized KNN chạy thử, the input anime series (Attachồng on Titan) had many unrelated but repeating words, such as “recap”, “rewrite”, “episode”, “humanity” and especially contained the phrase “recap of episodes”. Likewise, the synopses of the output animes contained this phrase “recap of episode” or a similar variant, but the recommendations were more dependent on this particular synopsis wording, rather than other features. In contrast, Cosine un-normalized consistently performed the worst for one-hot encoded features.

Additionally, we found that for very different input animes, lượt thích in our Example 4 kiểm tra, the KNN recommendations would have higher variance on average, with normalized KNN results having higher variance than un-normalized KNN.

## Conclusion

Though this approach yielded interesting results, there are some aspects that could be improved. For instance, our current dataset separates out different animes within the same series. Therefore, it could recommkết thúc a user who inputs an anime in the series, another anime within the same series. This is obviously not an ikhuyến mãi outcome because avid anime watchers likely would not be getting anything meaningful out of the recommendation engine. Rather, we want lớn be able lớn introduce people lớn new anime that they otherwise might not have known of. One way khổng lồ address this issue is khổng lồ compress all of the animes in a series down to lớn one row which would completely eliminate the possibility of these types of results. We could also introduce random noise or uncertainty, not only khổng lồ mitigate this problem but also so that the results are more likely to lớn be new và interesting khổng lồ the users.

### References

Ellis, Theo J. “How the Anime Industry Has Grown Since 2004, According to lớn Google Trends.” *Anime Motivation*, animemotivation.com, 23 June 2018, https://animemotivation.com/anime-industry-growth-2004-to-2018/.

Xem thêm: Những Trang Phục Đẹp Nhất Liên Minh Huyền Thoại 2018, Những Trang Phục Đẹp Nhất Ra Mắt Trong Năm 2020

“Full-Text Stopwords.” *MySQL*, Oracle Corporation, https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html.

Chuyên mục: Tin Tức