Demoji Save

Accurately find/replace/remove emojis in text strings

Project README

demoji

Accurately find or remove emojis from a blob of text using data from the Unicode Consortium's emoji code repository.

License PyPI Status Python


Major Changes in Version 1.x

Version 1.x of demoji now bundles Unicode data in the package at install time rather than requiring a download of the codes from unicode.org at runtime. Please see the CHANGELOG.md for detail and be familiar with the changes before updating from 0.x to 1.x.

To report any regressions, please open a GitHub issue.

Basic Usage

demoji exports several text-related functions for find-and-replace functionality with emojis:

>>> tweet = """\
... #startspreadingthenews yankees win great start by ๐ŸŽ…๐Ÿพ going 5strong innings with 5kโ€™s๐Ÿ”ฅ ๐Ÿ‚
... solo homerun ๐ŸŒ‹๐ŸŒ‹ with 2 solo homeruns and๐Ÿ‘น 3run homerunโ€ฆ ๐Ÿคก ๐Ÿšฃ๐Ÿผ ๐Ÿ‘จ๐Ÿฝโ€โš–๏ธ with rbiโ€™s โ€ฆ ๐Ÿ”ฅ๐Ÿ”ฅ
... ๐Ÿ‡ฒ๐Ÿ‡ฝ and ๐Ÿ‡ณ๐Ÿ‡ฎ to close the game๐Ÿ”ฅ๐Ÿ”ฅ!!!โ€ฆ.
... WHAT A GAME!!..
... """
>>> demoji.findall(tweet)
{
    "๐Ÿ”ฅ": "fire",
    "๐ŸŒ‹": "volcano",
    "๐Ÿ‘จ๐Ÿฝ\u200dโš–๏ธ": "man judge: medium skin tone",
    "๐ŸŽ…๐Ÿพ": "Santa Claus: medium-dark skin tone",
    "๐Ÿ‡ฒ๐Ÿ‡ฝ": "flag: Mexico",
    "๐Ÿ‘น": "ogre",
    "๐Ÿคก": "clown face",
    "๐Ÿ‡ณ๐Ÿ‡ฎ": "flag: Nicaragua",
    "๐Ÿšฃ๐Ÿผ": "person rowing boat: medium-light skin tone",
    "๐Ÿ‚": "ox",
}

See below for function API.

Command-line Use

You can use demoji or python -m demoji to replace emojis in file(s) or stdin with their :code: equivalents:

$ cat out.txt
All done! โœจ ๐Ÿฐ โœจ
$ demoji out.txt
All done! :sparkles: :shortcake: :sparkles:

$ echo 'All done! โœจ ๐Ÿฐ โœจ' | demoji
All done! :sparkles: :shortcake: :sparkles:

$ demoji -
we didnt start the ๐Ÿ”ฅ
we didnt start the :fire:

Reference

findall(string: str) -> Dict[str, str]

Find emojis within string. Return a mapping of {emoji: description}.

findall_list(string: str, desc: bool = True) -> List[str]

Find emojis within string. Return a list (with possible duplicates).

If desc is True, the list contains description codes. If desc is False, the list contains emojis.

replace(string: str, repl: str = "") -> str

Replace emojis in string with repl.

replace_with_desc(string: str, sep: str = ":") -> str

Replace emojis in string with their description codes. The codes are surrounded by sep.

last_downloaded_timestamp() -> datetime.datetime

Show the timestamp of last download for the emoji data bundled with the package.

Footnote: Emoji Sequences

Numerous emojis that look like single Unicode characters are actually multi-character sequences. Examples:

  • The keycap 2๏ธโƒฃ is actually 3 characters, U+0032 (the ASCII digit 2), U+FE0F (variation selector), and U+20E3 (combining enclosing keycap).
  • The flag of Scotland 7 component characters, b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0073\\U000e0063\\U000e0074\\U000e007f' in full esaped notation.

(You can see any of these through s.encode("unicode-escape").)

demoji is careful to handle this and should find the full sequences rather than their incomplete subcomponents.

The way it does this it to sort emoji codes by their length, and then compile a concatenated regular expression that will greedily search for longer emojis first, falling back to shorter ones if not found. This is not by any means a super-optimized way of searching as it has O(N2) properties, but the focus is on accuracy and completeness.

>>> from pprint import pprint
>>> seq = """\
... I bet you didn't know that ๐Ÿ™‹, ๐Ÿ™‹โ€โ™‚๏ธ, and ๐Ÿ™‹โ€โ™€๏ธ are three different emojis.
... """
>>> pprint(seq.encode('unicode-escape'))  # Python 3
(b"I bet you didn't know that \\U0001f64b, \\U0001f64b\\u200d\\u2642\\ufe0f,"
 b' and \\U0001f64b\\u200d\\u2640\\ufe0f are three different emojis.\\n')
Open Source Agenda is not affiliated with "Demoji" Project. README Source: bsolomon1124/demoji
Stars
147
Open Issues
4
Last Commit
5 months ago
Repository
License

Open Source Agenda Badge

Open Source Agenda Rating